5420 Anomaly Detection | Assignment 7 - Joyce Ng (jn2901)¶

The purpose of conducting anomaly analysis on healthcare/inpatient data is to identify and understand unusual patterns or outliers that could indicate potential issues or opportunities for improvement in healthcare delivery, cost management, and patient outcomes. Objective of this project is to perform exhaustive EDA on inpatient data and provide business insights of each feature. Then utilize Autoencoder and iForest to detect anomalies.¶

Desciption of dataset:

DRG Definition:

Diagnosis Related Group. Classification system that groups similar clinical conditions (diagnoses) and the procedures furnished by the hospital during the stay.

Total Discharges:

The number of discharges billed by all providers for inpatient hospital services.

Average Covered Charges:

= Total Covered Charge Amount / Total Discharges. The average charge of all provider's services covered by Medicare for discharges in the DRG. These will vary from hospital to hospital because of differences in hospital charge structures.

Average Total Payment:

= Total Payments / Total Discharges. The average total payments of what Medicare actually pays to the provider as well as co-payment and deductible amounts that the beneficiary is responsible for and payments by third parties for coordination of benefits.

Average Medicare Payment:

= Medicare Payment Amount / Total Discharges. The average amount that Medicare pays to the provider. Medicare payment amounts include the MS-DRG amount, teaching, disproportionate share, capital, and outlier payments for all cases. Medicare payments DO NOT include beneficiary co-payments and deductible amounts nor any additional payments from third parties for coordination of benefits.

Table of Contents¶

  • Section 1: Data Preparation
    • 1.1 Load Libraries and Dataset
    • 1.2 Check Missing Values & Change Columns Names and Data Type
  • Section 2: EDA
    • 2.1 Distribution of Variables
      • 2.1.1 Distributions of Avg. Payments
      • 2.1.2 Distributions of Charges and Payments
      • 2.1.3 DRG Frequency
      • 2.1.4 Distributions of Charges and Payments by State
  • Section 3: Feature Engineering
    • 3.1 Average Medicare Payments
      • 3.1.1 By DRG
      • 3.1.2 Per DRG by State
      • 3.1.3 Per DRG by State & City
      • 3.1.4 Per DRG by Zip Code
    • 3.2 Average Total Payments
      • 3.2.1 Per DRG by State
      • 3.2.2 Per DRG by State & City
      • 3.2.3 Per DRG by Zip Code
    • 3.3 Average Out-Of-Pocket Payments
      • 3.3.1 Per DRG by State
      • 3.3.2 Per DRG by State & City
      • 3.3.3 Per DRG by Zip Code
  • Section 4: Autoencoder
    • 4.1 Model 1
    • 4.2 Model 2
    • 4.3 Model 3
    • 4.4 Aggregate Multiple Models
    • 4.5 Combination by Average
      • 4.5.1 Outliers List
      • 4.5.2 Top 10 Outliers by DRG
      • 4.5.3 Outliers by State
    • 4.6 Combination by Max
  • Section 5: iForest
    • 5.1 iForest
    • 5.2 Combination by Average
      • 5.2.1 Outliers List
      • 5.2.2 Top 10 Outliers by DRG
      • 5.2.3 Outliers by State
  • Section 6: Model Comparison

     

Section 1: Data Preparation ¶

1.1: Load Libraries and Dataset ¶

In [1]:
# Load libraries
import pandas as pd
import numpy as np
from functools import reduce
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import roc_curve,roc_auc_score, confusion_matrix, f1_score, accuracy_score, make_scorer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from pyod.models.combination import aom, moa, average, maximization
from pyod.utils.utility import standardizer
from sklearn import metrics
from pyod.models.knn import KNN
from pyod.models.auto_encoder import AutoEncoder
from pyod.models.iforest import IForest

# Visiualization
import plotly.express as px
import plotly.graph_objs as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.io as pio
from IPython.display import display
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import matplotlib.pyplot as plt
import seaborn as sns

import warnings 
warnings.filterwarnings("ignore")
In [2]:
data = pd.read_csv('/Users/Joyce630/Desktop/Columbia/5420 Anomaly Detection/Assignments/4 - Healthcare Fraud/Data/inpatientCharges.csv')
data.head()
Out[2]:
DRG Definition Provider Id Provider Name Provider Street Address Provider City Provider State Provider Zip Code Hospital Referral Region Description Total Discharges Average Covered Charges Average Total Payments Average Medicare Payments
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 $5777.24 $4763.73
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10005 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH BOAZ AL 35957 AL - Birmingham 14 $15131.85 $5787.57 $4976.71
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10006 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET FLORENCE AL 35631 AL - Birmingham 24 $37560.37 $5434.95 $4453.79
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10011 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE BIRMINGHAM AL 35235 AL - Birmingham 25 $13998.28 $5417.56 $4129.16
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10016 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH ALABASTER AL 35007 AL - Birmingham 18 $31633.27 $5658.33 $4851.44
In [3]:
print(data.shape, "\n") # Check the shape of df
print(data.columns, "\n") # Check column names
print(data.info(), "\n") # Check info of the df
(163065, 12) 

Index(['DRG Definition', 'Provider Id', 'Provider Name',
       'Provider Street Address', 'Provider City', 'Provider State',
       'Provider Zip Code', 'Hospital Referral Region Description',
       ' Total Discharges ', ' Average Covered Charges ',
       ' Average Total Payments ', 'Average Medicare Payments'],
      dtype='object') 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163065 entries, 0 to 163064
Data columns (total 12 columns):
 #   Column                                Non-Null Count   Dtype 
---  ------                                --------------   ----- 
 0   DRG Definition                        163065 non-null  object
 1   Provider Id                           163065 non-null  int64 
 2   Provider Name                         163065 non-null  object
 3   Provider Street Address               163065 non-null  object
 4   Provider City                         163065 non-null  object
 5   Provider State                        163065 non-null  object
 6   Provider Zip Code                     163065 non-null  int64 
 7   Hospital Referral Region Description  163065 non-null  object
 8    Total Discharges                     163065 non-null  int64 
 9    Average Covered Charges              163065 non-null  object
 10   Average Total Payments               163065 non-null  object
 11  Average Medicare Payments             163065 non-null  object
dtypes: int64(3), object(9)
memory usage: 14.9+ MB
None 

1.2: Check Missing Values & Change Columns Names and Data Type ¶

In [4]:
# Check for missing values
missing_values = data.isnull().sum()
missing_values
Out[4]:
DRG Definition                          0
Provider Id                             0
Provider Name                           0
Provider Street Address                 0
Provider City                           0
Provider State                          0
Provider Zip Code                       0
Hospital Referral Region Description    0
 Total Discharges                       0
 Average Covered Charges                0
 Average Total Payments                 0
Average Medicare Payments               0
dtype: int64
In [5]:
# remove white space in 
data = data.rename(columns=lambda x: x.strip())
In [6]:
# Change data type 
data = data.rename(columns=lambda x: x.strip())
data['DRG Definition'] = data['DRG Definition'].astype('category')
data['Provider State'] = data['Provider State'].astype('category')
data['Provider City'] = data['Provider City'].astype('category')
data['Provider Id'] = data['Provider Id'].astype('category')
                   
# Remove $ sign from columns
data['Average Covered Charges'] = data['Average Covered Charges'].str.strip("$").astype((float))
data['Average Total Payments'] = data['Average Total Payments'].str.strip("$").astype((float))
data['Average Medicare Payments'] = data['Average Medicare Payments'].str.strip("$").astype((float))
In [7]:
# Remove white space in columns
data.columns = ['DRG','Provider_Id', 'Provider_Name','Provider_StreetAddress','Provider_City',
               'Provider_State','Provider_Zipcode','Hospital_referral_region_desp',
                'Total_Discharges','Average_Covered_Charges','Average_Total_Payments',
                'Average_Medicare_Payments']
In [8]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163065 entries, 0 to 163064
Data columns (total 12 columns):
 #   Column                         Non-Null Count   Dtype   
---  ------                         --------------   -----   
 0   DRG                            163065 non-null  category
 1   Provider_Id                    163065 non-null  category
 2   Provider_Name                  163065 non-null  object  
 3   Provider_StreetAddress         163065 non-null  object  
 4   Provider_City                  163065 non-null  category
 5   Provider_State                 163065 non-null  category
 6   Provider_Zipcode               163065 non-null  int64   
 7   Hospital_referral_region_desp  163065 non-null  object  
 8   Total_Discharges               163065 non-null  int64   
 9   Average_Covered_Charges        163065 non-null  float64 
 10  Average_Total_Payments         163065 non-null  float64 
 11  Average_Medicare_Payments      163065 non-null  float64 
dtypes: category(4), float64(3), int64(2), object(3)
memory usage: 11.1+ MB
In [9]:
data.head()
Out[9]:
DRG Provider_Id Provider_Name Provider_StreetAddress Provider_City Provider_State Provider_Zipcode Hospital_referral_region_desp Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 5777.24 4763.73
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10005 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH BOAZ AL 35957 AL - Birmingham 14 15131.85 5787.57 4976.71
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10006 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET FLORENCE AL 35631 AL - Birmingham 24 37560.37 5434.95 4453.79
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10011 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE BIRMINGHAM AL 35235 AL - Birmingham 25 13998.28 5417.56 4129.16
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10016 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH ALABASTER AL 35007 AL - Birmingham 18 31633.27 5658.33 4851.44

Section 2: EDA ¶

2.1: Distribution of Variables ¶

In [10]:
unique_id =  len(data.groupby('Provider_Id').count())
unique_providers = len(data.groupby('Provider_Name').count())
unique_states = len(data.groupby('Provider_State').count())
unique_cities = len(data.groupby('Provider_City').count())
unique_drg = len(data.groupby('DRG').count())

print(f'Unique ID = {unique_id}\n')
print(f'Unique Providers = {unique_providers}\n')
print(f'Unique States = {unique_states}\n')
print(f'Unique Cities = {unique_cities}\n')
print(f'Unique DRG = {unique_drg}\n')
Unique ID = 3337

Unique Providers = 3201

Unique States = 51

Unique Cities = 1977

Unique DRG = 100

In [11]:
data.describe()
Out[11]:
Provider_Zipcode Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments
count 163065.000000 163065.000000 163065.000000 163065.000000 163065.000000
mean 47938.121908 42.776304 36133.954224 9707.473804 8494.490964
std 27854.323080 51.104042 35065.365931 7664.642598 7309.467261
min 1040.000000 11.000000 2459.400000 2673.000000 1148.900000
25% 27261.000000 17.000000 15947.160000 5234.500000 4192.350000
50% 44309.000000 27.000000 25245.820000 7214.100000 6158.460000
75% 72901.000000 49.000000 43232.590000 11286.400000 10056.880000
max 99835.000000 3383.000000 929118.900000 156158.180000 154620.810000

2.1 Distribution of Avg. Payments ¶

In [12]:
# Create a figure
fig = go.Figure()

# Add histogram for Average Total Payments
fig.add_trace(go.Histogram(x=data['Average_Total_Payments'], name='Average_Total_Payments', opacity=1))
fig.add_trace(go.Histogram(x=data['Average_Medicare_Payments'], name='Average_Medicare_Payments', opacity=0.85))

# Update layout
fig.update_layout(
    title_text='Distribution of Average Total & Medicare Payments',
    xaxis_title_text='Payments ($)',
    yaxis_title_text='Frequency',
    height=800, width=950,
    template = 'seaborn',
    plot_bgcolor = '#F6F5F5'
)

# Show plot
fig.show()

Based on the graph, it's notable that the average Medicare payment exceeds the overall average payment, which is unusual and potentially concerning. Therefore, further investigation is necessary to identify and understand these discrepancies.

2.1.2 Distributions of Charges and Payments by States ¶

In [13]:
sns.pairplot(data[['Provider_State','Average_Total_Payments','Total_Discharges','Average_Medicare_Payments','Average_Covered_Charges']], 
             hue= 'Provider_State')
Out[13]:
<seaborn.axisgrid.PairGrid at 0x287eebe10>
No description has been provided for this image

The above graph show the distributions of different charges and payments on each states. There are already some outliers showed on the graph which need further investigations.

2.1.3 DRG Frequency ¶

Identify the most common DRGs by grouping the DRG and total number of discharges, then sort in an descending order.

In [14]:
common_drg = data.groupby('DRG')
common_drg.agg({'Total_Discharges': np.sum}).sort_values(by=['Total_Discharges'], ascending=0)[:20]
Out[14]:
Total_Discharges
DRG
470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC 427207
871 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W MCC 319072
392 - ESOPHAGITIS, GASTROENT & MISC DIGEST DISORDERS W/O MCC 244854
292 - HEART FAILURE & SHOCK W CC 222038
690 - KIDNEY & URINARY TRACT INFECTIONS W/O MCC 206695
194 - SIMPLE PNEUMONIA & PLEURISY W CC 198390
291 - HEART FAILURE & SHOCK W MCC 185599
641 - MISC DISORDERS OF NUTRITION,METABOLISM,FLUIDS/ELECTROLYTES W/O MCC 153660
683 - RENAL FAILURE W CC 150444
190 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W MCC 149677
191 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W CC 148491
312 - SYNCOPE & COLLAPSE 141918
603 - CELLULITIS W/O MCC 140894
378 - G.I. HEMORRHAGE W CC 138678
313 - CHEST PAIN 131079
193 - SIMPLE PNEUMONIA & PLEURISY W MCC 127989
287 - CIRCULATORY DISORDERS EXCEPT AMI, W CARD CATH W/O MCC 115920
192 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W/O CC/MCC 114790
310 - CARDIAC ARRHYTHMIA & CONDUCTION DISORDERS W/O CC/MCC 113454
872 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W/O MCC 112430
In [15]:
drg470 = data[data['DRG'].str.contains('470')]

The most common procedures finsihed by the hospital was 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC, In the following I created a scatter plot to take a deeper look at the average medicare payment of this group.

In [16]:
# Create scatter plot using Plotly Express with color mapped to 'Average_Covered_Charges'
fig = px.scatter(drg470, x='Provider_State', y='Average_Medicare_Payments',
                 color='Average_Medicare_Payments', 
                 color_continuous_scale='PuBuGn',
                 title='Average Medicare Payments for DRG 470 by State',
                 labels={'Provider_State': 'State', 'Average_Medicare_Payments': 'Average Medicare Payments'})

# Customize layout
fig.update_layout(
    xaxis_title='State', 
    yaxis_title='Average Medicare Payments',
    template='plotly_white'
)

# Show the figure
fig.show()

For DRG 470, the average medicare payments seems to be within 10k to 20k, with some states i.e. Maryland, California, New York over 25k or even above 30k.

Next, I'll take a look at how much each procedures had cost by each state.

In [17]:
def stat(data,group_var,cont_var):
    coln =list(data[[group_var,cont_var]].describe().index)
    row = data[[group_var]].drop_duplicates().reset_index(drop=True).values.tolist()
    row = [item for sublist in row for item in sublist]
    desc_cont_var = pd.DataFrame(data[[group_var,cont_var]].groupby(group_var).describe().values.tolist(), columns=coln,index=row)
    return desc_cont_var
In [18]:
stat(data,'DRG','Average_Total_Payments').head(n=10)
Out[18]:
count mean std min 25% 50% 75% max
039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 1079.0 6960.534004 1477.873952 4968.00 6001.8300 6582.890 7516.8250 18420.56
057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/O MCC 1201.0 6706.276445 2033.965862 4194.09 5412.8700 6093.750 7345.3600 25519.43
069 - TRANSIENT ISCHEMIA 1659.0 13263.823032 3847.918207 8174.28 10762.2200 12084.700 14424.3250 50882.40
064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W MCC 2269.0 7922.671141 2084.658336 5368.73 6626.2700 7280.050 8503.0600 26510.15
065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W CC 1806.0 5713.985221 1342.538675 3916.41 4819.3250 5326.025 6197.4800 14744.05
066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W/O CC/MCC 1962.0 5068.644292 1114.407934 3514.41 4320.7375 4740.235 5459.7375 12312.80
074 - CRANIAL & PERIPHERAL NERVE DISORDERS W/O MCC 979.0 6386.793177 1597.733831 4344.17 5278.4200 5903.250 7060.7800 15759.14
101 - SEIZURES W/O MCC 1593.0 5493.361883 1420.448742 3704.89 4530.4100 5058.160 6013.8600 15554.42
149 - DYSEQUILIBRIUM 988.0 4589.104342 1082.833515 3185.77 3826.4325 4285.315 5036.1925 9325.21
176 - PULMONARY EMBOLISM W/O MCC 1396.0 7279.954979 1544.880403 4891.30 6238.9800 6848.800 7933.6125 15783.05
In [19]:
stat(data,'DRG','Average_Medicare_Payments').head(n=10)
Out[19]:
count mean std min 25% 50% 75% max
039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 1079.0 5555.837525 1236.063439 3592.85 4759.8000 5269.280 6033.6100 15855.18
057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/O MCC 1201.0 5701.676570 1973.620083 3116.42 4467.6600 5104.780 6293.6000 22873.49
069 - TRANSIENT ISCHEMIA 1659.0 12112.849445 3702.047762 6603.63 9765.6300 10956.160 13094.6300 48632.28
064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W MCC 2269.0 6781.501785 1979.482762 4109.25 5555.9100 6167.510 7342.0300 23402.26
065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W CC 1806.0 4469.203560 1271.852222 2771.88 3653.9750 4079.575 4862.4650 13710.23
066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W/O CC/MCC 1962.0 3952.275102 1094.682773 2348.07 3240.5050 3627.465 4298.7650 9959.58
074 - CRANIAL & PERIPHERAL NERVE DISORDERS W/O MCC 979.0 5238.579438 1542.402076 2456.83 4201.5650 4815.660 5862.4700 13310.28
101 - SEIZURES W/O MCC 1593.0 4452.600709 1379.483972 2629.69 3535.5800 4012.660 4966.9500 13961.18
149 - DYSEQUILIBRIUM 988.0 3440.677065 1076.033108 2012.35 2717.5875 3145.555 3805.2575 8615.31
176 - PULMONARY EMBOLISM W/O MCC 1396.0 6077.948660 1505.418988 3561.68 5122.6200 5644.930 6554.8175 15161.27

2.1.4 Choropleth Map ¶

In [20]:
def plot_map(data, group_var, cont_var, stat='mean', color='RdBu_r', label=''):
    # Grouping the Data
    if stat == 'count':
        x = data.groupby(group_var).size()  # Count number of rows per group
    else:
        x = data.groupby(group_var)[cont_var].agg(stat)  # Compute specified statistic
    
  # Plotting Heat Map
    fig = go.Figure(go.Choropleth(
        locations=list(x.index), 
        z=x.astype(float),       
        locationmode='USA-states',
        colorscale=color,
        colorbar_title=f'{stat.capitalize()} {label.lower()}',
    ))
    
    fig.update_layout(
        title_text=f'{cont_var.replace("_", " ").title()} Per State',
        geo_scope='usa',
        height=800, width=950
    )
    
    return fig
In [21]:
# Plot the distribution of variables by states using choropleth map
p = [
    plot_map(data, "Provider_State", "Total_Discharges", stat='count'),
    plot_map(data, "Provider_State", "Average_Covered_Charges", stat='mean', label="Covered Charges"),
    plot_map(data, "Provider_State", "Average_Total_Payments", stat='mean', label="Total Payments"),
    plot_map(data, "Provider_State", "Average_Medicare_Payments", stat='mean', label="Medicare Payments")
]

for fig in p:
    fig.show()
     

Section 3: Feature Engineering ¶

3.1 Avgerage Medicare Payments ¶

For this section, benchmarks are calculated by grouping the DRGs, provider states, cities and zip code from the dataset, then take the mean of the Average medicare payments.

Four features are created under this section which are Average Medicare Payments by DRG, States, Cities & Zip Code. They can provides insights into healthcare payment patterns across various levels of granularity.

3.1.1 By DRG ¶

In [22]:
table_1 = data.groupby(['DRG'])['Average_Medicare_Payments'].mean().reset_index()
table_1 = table_1.rename(columns={'Average_Medicare_Payments': 'Average_Medicare_Payments_byDRG'})
table_1.head(10)
Out[22]:
DRG Average_Medicare_Payments_byDRG
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 5555.837525
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 5701.676570
2 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... 12112.849445
3 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... 6781.501785
4 066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... 4469.203560
5 069 - TRANSIENT ISCHEMIA 3952.275102
6 074 - CRANIAL & PERIPHERAL NERVE DISORDERS W/O... 5238.579438
7 101 - SEIZURES W/O MCC 4452.600709
8 149 - DYSEQUILIBRIUM 3440.677065
9 176 - PULMONARY EMBOLISM W/O MCC 6077.948660

To better identify outliers, additional quantiles i.e 90%, 95% are added by using the following function:¶

In [23]:
def calculate_descriptive_stats(data, column):

    '''
    This function calculate descriptive statistics with additional quantile.
    '''
    
    # Calculate basic descriptive statistics
    desc_stats = data[column].describe()

    # Calculate the 90th and 95th percentiles
    perc_90 = data[column].quantile(0.90)
    perc_95 = data[column].quantile(0.95)

    # Convert the descriptive statistics to a DataFrame
    desc_stats = desc_stats.to_frame().transpose()

    # Add the 90th and 95th percentiles to the DataFrame
    desc_stats['90%'] = perc_90
    desc_stats['95%'] = perc_95

    # Reorder the columns to match the desired format
    desc_stats = desc_stats[['mean', 'min', '25%', '50%', '75%', '90%', '95%', 'max']]

    # Transpose the DataFrame
    desc_stats = desc_stats.T

    return desc_stats
In [24]:
table_1_stats = calculate_descriptive_stats(table_1, 'Average_Medicare_Payments_byDRG')
table_1_stats
Out[24]:
Average_Medicare_Payments_byDRG
mean 9248.575795
min 2876.015593
25% 4465.052848
50% 6391.828157
75% 11444.865846
90% 17047.411069
95% 21501.661996
max 41899.432929

Based on the above table, payments that exceeded the 95% threshold should be considered as anomalies.

In [25]:
# Add average medicare payments by states to the dataset 
df1 = pd.merge(data, table_1, how='left', on=['DRG'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by DRG
df1['Average_Medicare_byDRG_Ratio'] = np.where(df1['Average_Medicare_Payments_byDRG']==0,0, df1['Average_Medicare_Payments'] / df1['Average_Medicare_Payments_byDRG'])

Distribution of Average Medicare Payments By DRG Ratio¶

In [26]:
# Define variables and quantile range
var = 'Average_Medicare_byDRG_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile)

# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(
    title='Distribution of Average Medicare Payments By DRG Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()

This feature focuses on understanding how Medicare payments vary across different DRGs. It helps in identifying anomalies such as DRGs with unusually high or low Medicare payments compared to the overall average.

3.1.2 Per DRG by State ¶

In [27]:
table_2 = data.groupby(['DRG', 'Provider_State'])['Average_Medicare_Payments'].mean().reset_index()
table_2.columns = ['DRG', 'Provider_State', 'Average_Medicare_Bystates']
table_2.head(10)
Out[27]:
DRG Provider_State Average_Medicare_Bystates
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AK 6413.780000
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AL 4599.593043
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AR 4938.712500
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AZ 5912.832917
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC CA 6878.954925
5 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC CO 5419.930000
6 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC CT 6709.101333
7 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC DC 8110.816667
8 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC DE 5719.316667
9 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC FL 4868.594568
In [28]:
table_2_stats = calculate_descriptive_stats(table_2, 'Average_Medicare_Bystates')
table_2_stats
Out[28]:
Average_Medicare_Bystates
mean 9289.670980
min 2216.372500
25% 4483.274762
50% 6635.284161
75% 11044.008750
90% 18103.081589
95% 24208.021478
max 63646.045000
In [29]:
# Add average medicare payments by states to the dataset 
df2 = pd.merge(data, table_2, how='left', on=['DRG', 'Provider_State'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df2['Average_Medicare_Bystates_Ratio'] = np.where(df2['Average_Medicare_Bystates']==0,0, df2['Average_Medicare_Payments'] / df2['Average_Medicare_Bystates'])
df2['Average_Medicare_Bystates_Ratio'].describe()
Out[29]:
count    163065.000000
mean          1.000000
std           0.215091
min           0.339831
25%           0.868395
50%           0.954250
75%           1.071483
max           5.174415
Name: Average_Medicare_Bystates_Ratio, dtype: float64

Distribution of Average Medicare Payments Per DRG By State Ratio¶

In [30]:
# Define variables and quantile range
var = 'Average_Medicare_Bystates_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df2[binned_var] = pd.qcut(df2[var], percentile)

# Create the bin labels
df2['bin_label'] = df2[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df2['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(
    title='Distribution of Average Medicare Payments State Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()

This feature adds state-level granularity to the analysis allows for identifying regional disparities in Medicare payments for each DRG. It helps understand variations in healthcare costs influenced by state-level policies, demographics, and healthcare infrastructure. Payments fall witin the higher range i.e. [1.74-5.17] could be flagged as anomalies.

3.1.3 Per DRG by State & City ¶

In [31]:
table_3 = data.groupby(['DRG', 'Provider_State', 'Provider_City'])['Average_Medicare_Payments'].mean().reset_index()
table_3 = table_3.rename(columns={'Average_Medicare_Payments': 'Average_Medicare_Bystatescity'}).sort_values(by='Average_Medicare_Bystatescity', ascending=False)

table_3.head(10)
Out[31]:
DRG Provider_State Provider_City Average_Medicare_Bystatescity
2126950 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... CA STANFORD 154620.81
9244283 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... NY VALHALLA 133177.26
2125896 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... CA FREMONT 113462.09
9285667 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS CA STANFORD 109303.21
9345110 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS NY VALHALLA 107456.18
5312030 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC NY VALHALLA 100040.26
6831303 469 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... PA HERSHEY 99114.86
2186393 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... NY VALHALLA 98792.46
9184840 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... CA STANFORD 94614.03
9361733 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS TX GALVESTON 89857.83
In [32]:
table_3_stats = calculate_descriptive_stats(table_3, 'Average_Medicare_Bystatescity')
table_3_stats
Out[32]:
Average_Medicare_Bystatescity
mean 8133.157995
min 1148.900000
25% 4070.555000
50% 5925.720000
75% 9701.325000
90% 14810.964000
95% 20455.067000
max 154620.810000
In [33]:
# Add average medicare payments by states to the dataset 
df3 = pd.merge(data, table_3, how='left', on=['DRG', 'Provider_State', 'Provider_City'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df3['Average_Medicare_Bystatescity_Ratio'] = np.where(df3['Average_Medicare_Bystatescity']==0,0, df3['Average_Medicare_Payments'] / df3['Average_Medicare_Bystatescity'])

Distribution of Average Medicare Payments By State and City Ratio¶

In [34]:
# Define variables and quantile range
var = 'Average_Medicare_Bystatescity_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df3[binned_var] = pd.qcut(df3[var], percentile, duplicates='drop')

# Create the bin labels
df3['bin_label'] = df3[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df3['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(
    title='Distribution of Average Medicare Payments By State and City Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()
In [35]:
# Calculate 95th percentile threshold
threshold = np.percentile(df3[var], 95)

# Create anomaly indicator
df3['Anomaly'] = np.where(df3[var] > threshold, 1, 0)

# Count anomalies
anomaly_counts = df3['Anomaly'].value_counts()
anomaly_counts
Out[35]:
Anomaly
0    154911
1      8154
Name: count, dtype: int64

This feature further disaggregating by city within each state provides insights into localized healthcare payment patterns. It identifies cities where Medicare payments differ markedly across various DRGs, indicating potential anomalies in healthcare service delivery or cost structures at the city level.

3.1.4 Avgerage Medicare Payments per DRG by Zip Code ¶

In [36]:
table_4 = data.groupby(['DRG', 'Provider_Zipcode'])['Average_Medicare_Payments'].mean().reset_index()
table_4 = table_4.rename(columns={'Average_Medicare_Payments': 'Average_Medicare_Byzip'}).sort_values(by='Average_Medicare_Byzip', ascending=False)

table_4.head(10)
Out[36]:
DRG Provider_Zipcode Average_Medicare_Byzip
66990 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... 94305 154620.81
278041 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... 10595 133177.26
66996 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... 94538 113462.09
283753 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS 94305 109303.21
281094 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS 10595 107456.18
283585 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS 90095 101717.82
158974 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC 10595 100040.26
204951 469 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... 17033 99114.86
64331 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... 10595 98792.46
278317 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... 20060 95701.42
In [37]:
table_4_stats = calculate_descriptive_stats(table_4, 'Average_Medicare_Byzip')
table_4_stats
Out[37]:
Average_Medicare_Byzip
mean 8455.257017
min 1148.900000
25% 4172.160000
50% 6125.210000
75% 10014.280000
90% 15625.386000
95% 21653.620000
max 154620.810000
In [38]:
# Add average medicare payments by states to the dataset 
df4 = pd.merge(data, table_4, how='left', on=['DRG', 'Provider_Zipcode'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df4['Average_Medicare_Byzip_Ratio'] = np.where(df4['Average_Medicare_Byzip']==0,0, df4['Average_Medicare_Payments'] / df4['Average_Medicare_Byzip'])

Distribution of Average Medicare Payments By Zip Code Ratio¶

In [39]:
# Define variables and quantile range
var = 'Average_Medicare_Byzip_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df4[binned_var] = pd.qcut(df4[var], percentile, duplicates='drop')

# Create the bin labels
df4['bin_label'] = df4[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df4['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)

# Update the layout
fig.update_layout(
    title='Distribution of Average Medicare Payments By Zip Code Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()
In [40]:
# Calculate 95th percentile threshold
threshold = np.percentile(df4[var], 95)

# Create anomaly indicator
df4['Anomaly'] = np.where(df4[var] > threshold, 1, 0)

# Count anomalies
anomaly_counts = df4['Anomaly'].value_counts()
anomaly_counts
Out[40]:
Anomaly
0    155070
1      7995
Name: count, dtype: int64

The count of anomaly will mainly performed on 'by state & city' and 'by zip code' because those two levels provide more indepth insights.

This feature add zip code-level granularity offers the most detailed view of Medicare payment variations within specific cities. It identifies zip codes where Medicare payments deviate significantly within the same city and state, pinpointing localized anomalies or disparities in healthcare utilization and costs.

3.2 Average Total Payments ¶

For this section, benchmarks are calculated by grouping the DRGs, provider states, cities and zip code from the dataset, then take the mean of the Average Total payments.

Three features are created under this section which are Average Total Payments by DRG, States, Cities & Zip Code. They can provides comprehensive insights into healthcare expenditure patterns across various levels of granularity.

3.2.1 Per DRG by State ¶

In [41]:
table_5 = data.groupby(['DRG', 'Provider_State'])['Average_Total_Payments'].mean().reset_index()
table_5 = table_5.rename(columns={'Average_Total_Payments': 'Average_Total_Bystate'}).sort_values(by='Average_Total_Bystate', ascending=False)

table_5.head(10)
Out[41]:
DRG Provider_State Average_Total_Bystate
2652 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC AK 68006.425000
4648 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... DC 66437.692500
4692 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS AK 61300.630000
4641 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... AK 57238.680000
4652 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... HI 56482.922000
4726 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS NY 56465.890641
1071 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... AK 54385.475000
4696 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS CA 53733.333710
4738 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS VT 52761.330000
4699 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS DC 52491.154000
In [42]:
table_5_stats = calculate_descriptive_stats(table_5, 'Average_Total_Bystate')
table_5_stats
Out[42]:
Average_Total_Bystate
mean 10640.517669
min 3236.026333
25% 5583.888889
50% 7780.726250
75% 12493.163000
90% 19845.450487
95% 27083.356325
max 68006.425000
In [43]:
# Add average medicare payments by states to the dataset 
df5 = pd.merge(data, table_5, how='left', on=['DRG', 'Provider_State'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df5['Average_Total_Bystates_Ratio'] = np.where(df5['Average_Total_Bystate']==0,0, df5['Average_Total_Payments'] / df5['Average_Total_Bystate'])

Distribution of Average Total Payments By State Ratio¶

In [44]:
# Define variables and quantile range
var = 'Average_Total_Bystates_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df5[binned_var] = pd.qcut(df5[var], percentile)

# Create the bin labels
df5['bin_label'] = df5[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df5['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.dense)

# Update the layout
fig.update_layout(
    title='Distribution of Average Total Payments By State Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()

This feature adds state-level granularity to the analysis identifies regional variations in total payments for each DRG.

3.2.2 Per DRG by State & City ¶

In [45]:
table_6 = data.groupby(['DRG', 'Provider_State', 'Provider_City'])['Average_Total_Payments'].mean().reset_index()
table_6 = table_6.rename(columns={'Average_Total_Payments': 'Average_Total_Bystatescity'}).sort_values(by='Average_Total_Bystatescity', ascending=False)

table_6.head(10)
Out[45]:
DRG Provider_State Provider_City Average_Total_Bystatescity
2126950 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... CA STANFORD 156158.18
9244283 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... NY VALHALLA 140255.26
2125896 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... CA FREMONT 119113.00
9345110 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS NY VALHALLA 119028.90
9285667 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS CA STANFORD 109945.57
5312030 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC NY VALHALLA 101796.26
6831303 469 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... PA HERSHEY 100018.33
2186393 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... NY VALHALLA 99929.46
9184840 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... CA STANFORD 95302.74
9361733 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS TX GALVESTON 90701.58
In [46]:
table_6_stats = calculate_descriptive_stats(table_6, 'Average_Total_Bystatescity')
table_6_stats
Out[46]:
Average_Total_Bystatescity
mean 9316.580074
min 2673.000000
25% 5105.865000
50% 6969.000000
75% 10860.125000
90% 16328.638000
95% 22435.810500
max 156158.180000
In [47]:
# Add average medicare payments by states to the dataset 
df6 = pd.merge(data, table_6, how='left', on=['DRG', 'Provider_State', 'Provider_City'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df6['Average_Total_Bystatescity_Ratio'] = np.where(df6['Average_Total_Bystatescity']==0,0, df6['Average_Total_Payments'] / df6['Average_Total_Bystatescity'])

Distribution of Average Total Payments By State and City Ratio¶

In [48]:
# Define variables and quantile range
var = 'Average_Total_Bystatescity_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df6[binned_var] = pd.qcut(df6[var], percentile, duplicates='drop')

# Create the bin labels
df6['bin_label'] = df6[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df6['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.dense)

# Update the layout
fig.update_layout(
    title='Distribution of Average Total Payments By State and City Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()
In [49]:
# Calculate 95th percentile threshold
threshold = np.percentile(df6[var], 95)

# Create anomaly indicator
df6['Anomaly'] = np.where(df6[var] > threshold, 1, 0)

# Count anomalies
anomaly_counts = df6['Anomaly'].value_counts()
anomaly_counts
Out[49]:
Anomaly
0    154911
1      8154
Name: count, dtype: int64

Simliar to Average medicare payments, this feature add additional level of granularity in detecting anomalies.

3.2.3 Per DRG by Zip Code ¶

In [50]:
table_7 = data.groupby(['DRG', 'Provider_Zipcode'])['Average_Total_Payments'].mean().reset_index()
table_7 = table_7.rename(columns={'Average_Total_Payments': 'Average_Total_Byzip'}).sort_values(by='Average_Total_Byzip', ascending=False)

table_7.head(10)
Out[50]:
DRG Provider_Zipcode Average_Total_Byzip
66990 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... 94305 156158.18
278041 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... 10595 140255.26
66996 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... 94538 119113.00
281094 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS 10595 119028.90
283753 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS 94305 109945.57
283585 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS 90095 103802.96
158974 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC 10595 101796.26
204951 469 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... 17033 100018.33
64331 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... 10595 99929.46
159489 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC 29220 99307.30
In [51]:
table_7_stats = calculate_descriptive_stats(table_7, 'Average_Total_Byzip')
table_7_stats
Out[51]:
Average_Total_Byzip
mean 9665.549998
min 2682.640000
25% 5215.000000
50% 7178.760000
75% 11233.000000
90% 17190.889000
95% 23786.226000
max 156158.180000
In [52]:
# Add average medicare payments by states to the dataset 
df7 = pd.merge(data, table_7, how='left', on=['DRG', 'Provider_Zipcode'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df7['Average_Total_Byzip_Ratio'] = np.where(df7['Average_Total_Byzip']==0,0, df7['Average_Total_Payments'] / df7['Average_Total_Byzip'])

Distribution of Average Total Payments By Zip Code Ratio¶

In [53]:
# Define variables and quantile range
var = 'Average_Total_Byzip_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df7[binned_var] = pd.qcut(df7[var], percentile, duplicates='drop')

# Create the bin labels
df7['bin_label'] = df7[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df7['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.dense)

# Update the layout
fig.update_layout(
    title='Distribution of Average Total Payments By Zip Code Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()
In [54]:
# Calculate 95th percentile threshold
threshold = np.percentile(df7[var], 95)

# Create anomaly indicator
df7['Anomaly'] = np.where(df7[var] > threshold, 1, 0)

# Count anomalies
anomaly_counts = df7['Anomaly'].value_counts()
anomaly_counts
Out[54]:
Anomaly
0    155056
1      8009
Name: count, dtype: int64

Simliar to avg medicare payments, this feature add additional level of granularity.

This section of Average Total Payments focuses all costs associated with healthcare services, offering a broader perspective on anomalies related to overall service costs, resource, and patient payments.

3.3 Average Out of Pocket Payments ¶

For this section, benchmarks are calculated by using Total payments minus Medicare payments to get the Average Out-Of-Pocket Payments of Patients. Then group the DRGs, provider states, cities and zip code from the dataset, then take the mean of the Average OOP payments.

Three features are created under this section which are Average OOP Payments per DRG, by States, Cities & Zip Code. They can provides insights into healthcare expenditure patterns across various levels of granularity from the patient's perspective. Hospital that charges a higher than average OOP cost could considered anomalies.

3.3.1 Per DRG by States ¶

In [55]:
data['Average_OOP'] = data['Average_Total_Payments'] - data['Average_Medicare_Payments']
data.head()
Out[55]:
DRG Provider_Id Provider_Name Provider_StreetAddress Provider_City Provider_State Provider_Zipcode Hospital_referral_region_desp Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments Average_OOP
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 5777.24 4763.73 1013.51
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10005 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH BOAZ AL 35957 AL - Birmingham 14 15131.85 5787.57 4976.71 810.86
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10006 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET FLORENCE AL 35631 AL - Birmingham 24 37560.37 5434.95 4453.79 981.16
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10011 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE BIRMINGHAM AL 35235 AL - Birmingham 25 13998.28 5417.56 4129.16 1288.40
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10016 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH ALABASTER AL 35007 AL - Birmingham 18 31633.27 5658.33 4851.44 806.89
In [56]:
table_8 = data.groupby(['DRG', 'Provider_State'])['Average_OOP'].mean().reset_index()
table_8 = table_8.rename(columns={'Average_OOP': 'Average_OOP_Bystates'})
table_8.head(10)
Out[56]:
DRG Provider_State Average_OOP_Bystates
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AK 1988.170000
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AL 1144.018696
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AR 1180.116250
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AZ 1358.722917
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC CA 1508.977761
5 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC CO 1631.992000
6 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC CT 1301.466000
7 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC DC 1383.156667
8 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC DE 1408.896667
9 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC FL 1334.597778
In [57]:
table_8_stats = calculate_descriptive_stats(table_8, 'Average_OOP_Bystates')
table_8_stats
Out[57]:
Average_OOP_Bystates
mean 1350.846689
min 359.710000
25% 957.941455
50% 1105.953333
75% 1424.579545
90% 2109.421204
95% 2687.367642
max 19964.515000
In [58]:
# Add average medicare payments by states to the dataset 
df8 = pd.merge(data, table_8, how='left', on=['DRG', 'Provider_State'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df8['Average_OOP_Bystates_Ratio'] = np.where(df8['Average_OOP_Bystates']==0,0, df8['Average_OOP'] / df8['Average_OOP_Bystates'])

Distribution of Average Out-Of-Pocket Payments By State Ratio¶

In [59]:
# Define variables and quantile range
var = 'Average_OOP_Bystates_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df8[binned_var] = pd.qcut(df8[var], percentile)

# Create the bin labels
df8['bin_label'] = df8[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df8['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Purpor_r)

# Update the layout
fig.update_layout(
    title='Distribution of Average Out-Of-Pocket Payments By State Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()

This feature provides insights into the disparities in patient expenses across different states. Payments fall into the high range i.e. [1.91-3.20], [3.20-38.19] could be could be consider anomalies.

3.3.2 Per DRG by State & City ¶

In [60]:
table_9 = data.groupby(['DRG', 'Provider_State', 'Provider_City'])['Average_OOP'].mean().reset_index()
table_9 = table_9.rename(columns={'Average_OOP': 'Average_OOP_Bystatescity'}).sort_values(by='Average_OOP_Bystatescity', ascending=False)
table_9.head(10)
Out[60]:
DRG Provider_State Provider_City Average_OOP_Bystatescity
2916206 249 - PERC CARDIOVASC PROC W NON-DRUG-ELUTING ... WA BELLEVUE 75998.660
6748071 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC WA FEDERAL WAY 51457.650
3054632 252 - OTHER VASCULAR PROCEDURES W MCC IN BLOOMINGTON 41686.250
2054178 203 - BRONCHITIS & ASTHMA W/O CC/MCC MA ATTLEBORO 39691.910
2805011 247 - PERC CARDIOVASC PROC W DRUG-ELUTING STEN... SD SIOUX FALLS 37708.605
2682933 246 - PERC CARDIOVASC PROC W DRUG-ELUTING STEN... NJ BERLIN 37034.820
6723696 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC NY WEST ISLIP 35437.920
1286505 189 - PULMONARY EDEMA & RESPIRATORY FAILURE PA READING 33117.260
5329100 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC TX MCKINNEY 32245.250
2551805 244 - PERMANENT CARDIAC PACEMAKER IMPLANT W/O ... IN RICHMOND 32146.450
In [61]:
table_9_stats = calculate_descriptive_stats(table_9, 'Average_OOP_Bystatescity')
table_9_stats
Out[61]:
Average_OOP_Bystatescity
mean 1183.422079
min 0.000000
25% 783.055000
50% 938.000000
75% 1204.592778
90% 1846.114000
95% 2579.456000
max 75998.660000
In [62]:
# Add average medicare payments by states to the dataset 
df9 = pd.merge(data, table_9, how='left', on=['DRG', 'Provider_State', 'Provider_City'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df9['Average_OOP_Bystatescity_Ratio'] = np.where(df9['Average_OOP_Bystatescity']==0,0, df9['Average_OOP'] / df9['Average_OOP_Bystatescity'])

Distribution of Average Out-Of-Pocket Payments By State & City Ratio¶

In [63]:
# Define variables and quantile range
var = 'Average_OOP_Bystatescity_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df9[binned_var] = pd.qcut(df9[var], percentile, duplicates='drop')

# Create the bin labels
df9['bin_label'] = df9[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df9['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Purpor)

# Update the layout
fig.update_layout(
    title='Distribution of Average Out-Of-Pocket Payments By State & City Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()
In [64]:
# Calculate 95th percentile threshold
threshold = np.percentile(df9[var], 95)

# Create anomaly indicator
df9['Anomaly'] = np.where(df9[var] > threshold, 1, 0)

# Count anomalies
anomaly_counts = df9['Anomaly'].value_counts()
anomaly_counts
Out[64]:
Anomaly
0    154911
1      8154
Name: count, dtype: int64

Similar as the previous features, this adds city level granularity. Payments fall in the higher bin could be flagged as anomalies.

3.3.3 Per DRG by Zip Code ¶

In [65]:
table_10 = data.groupby(['DRG', 'Provider_Zipcode'])['Average_OOP'].mean().reset_index()
table_10 = table_10.rename(columns={'Average_OOP': 'Average_OOP_Byzip'}).sort_values(by='Average_OOP_Byzip', ascending=False)
table_10.head(10)
Out[65]:
DRG Provider_Zipcode Average_OOP_Byzip
88487 249 - PERC CARDIOVASC PROC W NON-DRUG-ELUTING ... 98004 75998.66
84205 247 - PERC CARDIOVASC PROC W DRUG-ELUTING STEN... 57108 74168.03
37116 189 - PULMONARY EDEMA & RESPIRATORY FAILURE 19605 65079.84
204500 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC 98003 51457.65
65978 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... 60657 41849.50
93126 252 - OTHER VASCULAR PROCEDURES W MCC 47403 41686.25
61115 203 - BRONCHITIS & ASTHMA W/O CC/MCC 2703 39691.91
79549 246 - PERC CARDIOVASC PROC W DRUG-ELUTING STEN... 8009 37034.82
201759 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC 11795 35437.92
90311 251 - PERC CARDIOVASC PROC W/O CORONARY ARTERY... 57108 34721.65
In [66]:
table_10_stats = calculate_descriptive_stats(table_10, 'Average_OOP_Byzip')
table_10_stats
Out[66]:
Average_OOP_Byzip
mean 1210.292981
min 0.000000
25% 778.770000
50% 938.000000
75% 1222.000000
90% 1923.914000
95% 2715.580000
max 75998.660000
In [67]:
# Add average medicare payments by states to the dataset 
df10 = pd.merge(data, table_10, how='left', on=['DRG', 'Provider_Zipcode'])

# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df10['Average_OOP_Byzip_Ratio'] = np.where(df10['Average_OOP_Byzip']==0,0, df10['Average_OOP'] / df10['Average_OOP_Byzip'])

Distribution of Average Out-Of-Pocket Payments By Zip Code Ratio¶

In [68]:
# Define variables and quantile range
var = 'Average_OOP_Byzip_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]

# Create the binned variable
df10[binned_var] = pd.qcut(df10[var], percentile, duplicates='drop')

# Create the bin labels
df10['bin_label'] = df10[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')

# Count the occurrences in each bin
bin_counts = df10['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']

# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
             color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Purpor)

# Update the layout
fig.update_layout(
    title='Distribution of Average Out-Of-Pocket Payments By Zip Code Ratio',
    xaxis_title='Quantile Range',
    yaxis_title='Count',
    template = 'simple_white'
)

# Show the plot
fig.show()
In [69]:
# Calculate 95th percentile threshold
threshold = np.percentile(df10[var], 95)

# Create anomaly indicator
df10['Anomaly'] = np.where(df10[var] > threshold, 1, 0)

# Count anomalies
anomaly_counts = df10['Anomaly'].value_counts()
anomaly_counts
Out[69]:
Anomaly
0    155070
1      7995
Name: count, dtype: int64

Similar as the previous features, this adds more level of granularity. Payments fall in the higher bin could be flagged as anomalies.

Data Preparation for Modeling¶

In [70]:
# Append new features back to orginal dataset for future anlaysis
# Make a copy of the original DataFrame
data_copy = data.copy()

# List of DataFrames
dfs = [df1, df2, df3, df4, df5, df6, df7, df8, df9, df10]

# Function to filter and merge only the 'Ratio' columns
def merge_ratio_columns(left, right):
    ratio_columns = [col for col in right.columns if col.endswith('Ratio')]
    return pd.merge(left, right[['DRG', 'Provider_Id', 'Provider_State', 'Provider_City', 'Provider_Zipcode'] + ratio_columns], 
                    on=['DRG', 'Provider_Id', 'Provider_State', 'Provider_City', 'Provider_Zipcode'], how='left')

# Use reduce to iteratively merge all DataFrames with the ratio columns into df_copy
final_df = reduce(merge_ratio_columns, dfs, data_copy)

# Display the final merged DataFrame
final_df.head()
Out[70]:
DRG Provider_Id Provider_Name Provider_StreetAddress Provider_City Provider_State Provider_Zipcode Hospital_referral_region_desp Total_Discharges Average_Covered_Charges ... Average_Medicare_byDRG_Ratio Average_Medicare_Bystates_Ratio Average_Medicare_Bystatescity_Ratio Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 ... 0.857428 1.035685 1.038763 1.0 1.005855 1.037810 1.0 0.885921 1.033356 1.0
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10005 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH BOAZ AL 35957 AL - Birmingham 14 15131.85 ... 0.895762 1.081989 1.000000 1.0 1.007653 1.000000 1.0 0.708782 1.000000 1.0
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10006 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET FLORENCE AL 35631 AL - Birmingham 24 37560.37 ... 0.801642 0.968301 1.000000 1.0 0.946260 1.000000 1.0 0.857643 1.000000 1.0
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10011 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE BIRMINGHAM AL 35235 AL - Birmingham 25 13998.28 ... 0.743211 0.897723 0.899200 1.0 0.943232 0.915208 1.0 1.126205 0.970586 1.0
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10016 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH ALABASTER AL 35007 AL - Birmingham 18 31633.27 ... 0.873215 1.054754 1.000000 1.0 0.985152 1.000000 1.0 0.705312 1.000000 1.0

5 rows × 23 columns

In [71]:
# Extracting only the ratio columns
ratio_columns = [col for col in final_df.columns if col.endswith('Ratio')]
X = final_df[ratio_columns]

# Splitting data into train and test sets for X (features only)
X_train, X_test = train_test_split(X, test_size=0.2, random_state=123)
In [72]:
X_train
Out[72]:
Average_Medicare_byDRG_Ratio Average_Medicare_Bystates_Ratio Average_Medicare_Bystatescity_Ratio Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio
22044 0.950037 0.874802 1.000000 1.000000 0.897649 1.000000 1.000000 1.088349 1.000000 1.000000
85476 1.859740 1.463951 1.095857 1.000000 1.398264 1.074484 1.000000 0.483648 0.589728 1.000000
126489 0.937190 0.733368 0.927008 0.927008 0.838599 0.973378 0.973378 1.426914 1.136762 1.136762
93821 1.147512 1.333409 1.137058 1.000000 1.227748 1.110102 1.000000 0.659707 0.882715 1.000000
133143 0.754529 0.844225 1.000000 1.000000 0.876606 1.000000 1.000000 1.003489 1.000000 1.000000
... ... ... ... ... ... ... ... ... ... ...
146449 1.490217 1.595831 1.261310 1.000000 1.528690 1.246431 1.000000 1.118468 1.130201 1.000000
119906 1.336942 1.454418 1.263847 1.289831 1.246224 1.195356 1.219424 0.635439 0.876429 0.892346
17730 1.432104 1.433615 1.073340 1.000000 1.365176 1.062598 1.000000 0.606056 0.841591 1.000000
28030 1.658940 1.287742 1.000000 1.000000 1.266112 1.000000 1.000000 1.156397 1.000000 1.000000
15725 1.311969 1.048922 0.926972 1.000000 1.041244 0.925228 1.000000 0.902966 0.890192 1.000000

130452 rows × 10 columns

In [73]:
# Standardize data
X_train = StandardScaler().fit_transform(X_train)
X_train = pd.DataFrame(X_train)
X_test = StandardScaler().fit_transform(X_test)
X_test = pd.DataFrame(X_test)
In [74]:
X_train.shape
Out[74]:
(130452, 10)

Section 4: Autoencoder

Autoencoder is a type of neural network used for unsupervised learning of efficient codings. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction or anomaly detection. It detect anomalies by measuring the reconstruction error. Data points with high reconstruction error are considered anomalies. It consists of two parts:

  • Encoder: Maps the input data to a lower-dimensional representation.
  • Decoder: Maps the encoded data back to the original data space.

Use Cases:

  • Anomaly Detection
  • Dimensinality Reduction
  • Denoising

4.1 Model 1 ¶

In [75]:
clf1 = AutoEncoder(contamination=0.05, hidden_neurons =[25, 2, 2, 25])
clf1.fit(X_train)
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape              ┃    Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 10)                │        110 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout (Dropout)               │ (None, 10)                │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_1 (Dense)                 │ (None, 10)                │        110 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_1 (Dropout)             │ (None, 10)                │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_2 (Dense)                 │ (None, 25)                │        275 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_2 (Dropout)             │ (None, 25)                │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_3 (Dense)                 │ (None, 2)                 │         52 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_3 (Dropout)             │ (None, 2)                 │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_4 (Dense)                 │ (None, 2)                 │          6 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_4 (Dropout)             │ (None, 2)                 │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_5 (Dense)                 │ (None, 25)                │         75 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_5 (Dropout)             │ (None, 25)                │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_6 (Dense)                 │ (None, 10)                │        260 │
└─────────────────────────────────┴───────────────────────────┴────────────┘
 Total params: 888 (3.47 KB)
 Trainable params: 888 (3.47 KB)
 Non-trainable params: 0 (0.00 B)
None
Epoch 1/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 331us/step - loss: 9.4096 - val_loss: 1.3466
Epoch 2/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 1.2268 - val_loss: 1.0661
Epoch 3/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 1.0227 - val_loss: 1.0231
Epoch 4/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 0.9917 - val_loss: 1.0182
Epoch 5/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 0.9987 - val_loss: 1.0176
Epoch 6/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 317us/step - loss: 0.9991 - val_loss: 1.0172
Epoch 7/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 641us/step - loss: 1.0018 - val_loss: 1.0172
Epoch 8/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 326us/step - loss: 0.9929 - val_loss: 1.0172
Epoch 9/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 0.9940 - val_loss: 1.0172
Epoch 10/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 335us/step - loss: 1.0036 - val_loss: 1.0172
Epoch 11/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 359us/step - loss: 1.0009 - val_loss: 1.0172
Epoch 12/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 346us/step - loss: 0.9975 - val_loss: 1.0172
Epoch 13/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 350us/step - loss: 1.0038 - val_loss: 1.0172
Epoch 14/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 337us/step - loss: 0.9980 - val_loss: 1.0172
Epoch 15/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 597us/step - loss: 1.0119 - val_loss: 1.0172
Epoch 16/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 331us/step - loss: 1.0007 - val_loss: 1.0172
Epoch 17/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 330us/step - loss: 1.0023 - val_loss: 1.0172
Epoch 18/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 330us/step - loss: 1.0091 - val_loss: 1.0172
Epoch 19/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 327us/step - loss: 0.9884 - val_loss: 1.0172
Epoch 20/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 322us/step - loss: 1.0019 - val_loss: 1.0172
Epoch 21/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 322us/step - loss: 0.9907 - val_loss: 1.0172
Epoch 22/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 0.9980 - val_loss: 1.0172
Epoch 23/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 320us/step - loss: 0.9965 - val_loss: 1.0172
Epoch 24/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 0.9945 - val_loss: 1.0172
Epoch 25/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 320us/step - loss: 0.9993 - val_loss: 1.0172
Epoch 26/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 1.0095 - val_loss: 1.0172
Epoch 27/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 323us/step - loss: 1.0021 - val_loss: 1.0172
Epoch 28/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 319us/step - loss: 1.0051 - val_loss: 1.0172
Epoch 29/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 1.0006 - val_loss: 1.0172
Epoch 30/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 1.0036 - val_loss: 1.0172
Epoch 31/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 320us/step - loss: 0.9978 - val_loss: 1.0172
Epoch 32/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 0.9995 - val_loss: 1.0172
Epoch 33/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 327us/step - loss: 0.9957 - val_loss: 1.0172
Epoch 34/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 324us/step - loss: 1.0080 - val_loss: 1.0172
Epoch 35/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 336us/step - loss: 1.0028 - val_loss: 1.0172
Epoch 36/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 1.0011 - val_loss: 1.0172
Epoch 37/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 1.0028 - val_loss: 1.0172
Epoch 38/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 320us/step - loss: 0.9950 - val_loss: 1.0172
Epoch 39/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 1.0163 - val_loss: 1.0172
Epoch 40/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 406us/step - loss: 0.9981 - val_loss: 1.0172
Epoch 41/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 350us/step - loss: 0.9921 - val_loss: 1.0172
Epoch 42/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 319us/step - loss: 0.9984 - val_loss: 1.0172
Epoch 43/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 325us/step - loss: 0.9915 - val_loss: 1.0172
Epoch 44/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 327us/step - loss: 0.9846 - val_loss: 1.0172
Epoch 45/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 320us/step - loss: 1.0127 - val_loss: 1.0172
Epoch 46/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 319us/step - loss: 0.9994 - val_loss: 1.0172
Epoch 47/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 0.9902 - val_loss: 1.0172
Epoch 48/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 1.0028 - val_loss: 1.0172
Epoch 49/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 317us/step - loss: 0.9923 - val_loss: 1.0172
Epoch 50/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 319us/step - loss: 0.9998 - val_loss: 1.0172
Epoch 51/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 323us/step - loss: 1.0002 - val_loss: 1.0172
Epoch 52/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 328us/step - loss: 1.0109 - val_loss: 1.0172
Epoch 53/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 320us/step - loss: 1.0099 - val_loss: 1.0172
Epoch 54/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 1.0034 - val_loss: 1.0172
Epoch 55/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 325us/step - loss: 1.0007 - val_loss: 1.0172
Epoch 56/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 322us/step - loss: 1.0091 - val_loss: 1.0172
Epoch 57/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 1.0037 - val_loss: 1.0172
Epoch 58/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 323us/step - loss: 0.9957 - val_loss: 1.0172
Epoch 59/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 322us/step - loss: 0.9833 - val_loss: 1.0172
Epoch 60/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 324us/step - loss: 0.9982 - val_loss: 1.0172
Epoch 61/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 323us/step - loss: 0.9951 - val_loss: 1.0172
Epoch 62/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 424us/step - loss: 1.0189 - val_loss: 1.0172
Epoch 63/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 325us/step - loss: 1.0119 - val_loss: 1.0172
Epoch 64/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 326us/step - loss: 0.9928 - val_loss: 1.0172
Epoch 65/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 325us/step - loss: 0.9972 - val_loss: 1.0172
Epoch 66/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 324us/step - loss: 1.0097 - val_loss: 1.0172
Epoch 67/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 1.0071 - val_loss: 1.0172
Epoch 68/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 1.0083 - val_loss: 1.0172
Epoch 69/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 322us/step - loss: 0.9977 - val_loss: 1.0172
Epoch 70/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 329us/step - loss: 1.0079 - val_loss: 1.0172
Epoch 71/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 331us/step - loss: 1.0026 - val_loss: 1.0172
Epoch 72/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 332us/step - loss: 0.9974 - val_loss: 1.0172
Epoch 73/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 339us/step - loss: 0.9897 - val_loss: 1.0172
Epoch 74/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 372us/step - loss: 0.9830 - val_loss: 1.0172
Epoch 75/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 330us/step - loss: 0.9991 - val_loss: 1.0172
Epoch 76/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 375us/step - loss: 0.9904 - val_loss: 1.0172
Epoch 77/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 0.9804 - val_loss: 1.0172
Epoch 78/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 319us/step - loss: 0.9965 - val_loss: 1.0172
Epoch 79/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 0.9942 - val_loss: 1.0172
Epoch 80/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 334us/step - loss: 1.0076 - val_loss: 1.0172
Epoch 81/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 537us/step - loss: 0.9975 - val_loss: 1.0172
Epoch 82/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 1.0003 - val_loss: 1.0172
Epoch 83/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 389us/step - loss: 1.0101 - val_loss: 1.0172
Epoch 84/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 339us/step - loss: 0.9761 - val_loss: 1.0172
Epoch 85/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 399us/step - loss: 1.0048 - val_loss: 1.0172
Epoch 86/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 0.9818 - val_loss: 1.0172
Epoch 87/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 356us/step - loss: 1.0224 - val_loss: 1.0172
Epoch 88/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 393us/step - loss: 1.0202 - val_loss: 1.0172
Epoch 89/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 325us/step - loss: 0.9974 - val_loss: 1.0172
Epoch 90/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 322us/step - loss: 1.0037 - val_loss: 1.0172
Epoch 91/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 338us/step - loss: 0.9960 - val_loss: 1.0172
Epoch 92/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 3s 886us/step - loss: 0.9916 - val_loss: 1.0172
Epoch 93/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 382us/step - loss: 1.0014 - val_loss: 1.0172
Epoch 94/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 447us/step - loss: 0.9919 - val_loss: 1.0172
Epoch 95/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 331us/step - loss: 1.0041 - val_loss: 1.0172
Epoch 96/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 328us/step - loss: 0.9991 - val_loss: 1.0172
Epoch 97/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 323us/step - loss: 1.0028 - val_loss: 1.0172
Epoch 98/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 322us/step - loss: 0.9938 - val_loss: 1.0172
Epoch 99/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 323us/step - loss: 0.9955 - val_loss: 1.0172
Epoch 100/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 323us/step - loss: 0.9972 - val_loss: 1.0172
4077/4077 ━━━━━━━━━━━━━━━━━━━━ 1s 159us/step
Out[75]:
AutoEncoder(batch_size=32, contamination=0.05, dropout_rate=0.2, epochs=100,
      hidden_activation='relu', hidden_neurons=[25, 2, 2, 25],
      l2_regularizer=0.1,
      loss=<function mean_squared_error at 0x16e35dbc0>, optimizer='adam',
      output_activation='sigmoid', preprocessing=True, random_state=None,
      validation_size=0.1, verbose=1)
In [76]:
# Get y train scores
y_train_scores = clf1.decision_scores_

threshold = clf1.threshold_

# get the prediction on the test data
y_test_pred = clf1.predict(X_test)  # outlier labels (0 or 1)

# Predict anomaly score of X using fitted detector
y_test_scores = clf1.decision_function(X_test)  # outlier scores

y_test_pred = pd.Series(y_test_pred)
y_test_scores = pd.Series(y_test_scores)
1020/1020 ━━━━━━━━━━━━━━━━━━━━ 0s 173us/step
1020/1020 ━━━━━━━━━━━━━━━━━━━━ 0s 160us/step
In [77]:
print(y_test_scores.describe(), '\n')
print("Threshold for Contamination Rate:", clf1.threshold_, '\n')
clf1.get_params()
count    32613.000000
mean         2.257349
std          2.214617
min          0.035631
25%          0.990732
50%          1.493886
75%          2.679246
max         55.760914
dtype: float64 

Threshold for Contamination Rate: 6.585086208915477 

Out[77]:
{'batch_size': 32,
 'contamination': 0.05,
 'dropout_rate': 0.2,
 'epochs': 100,
 'hidden_activation': 'relu',
 'hidden_neurons': [25, 2, 2, 25],
 'l2_regularizer': 0.1,
 'loss': <function keras.src.losses.losses.mean_squared_error(y_true, y_pred)>,
 'optimizer': 'adam',
 'output_activation': 'sigmoid',
 'preprocessing': True,
 'random_state': None,
 'validation_size': 0.1,
 'verbose': 1}
In [78]:
y_test_pred.value_counts()
Out[78]:
0    30997
1     1616
Name: count, dtype: int64
In [79]:
# Function to plot histograms with threshold annotation and color differentiation
def plot_histogram_with_threshold(scores, title, threshold):
    fig = go.Figure()

    # Split the scores into two parts: below the threshold and above/equal to the threshold
    below_threshold = scores[scores < threshold]
    above_threshold = scores[scores >= threshold]

    # Add histogram traces for the two parts
    fig.add_trace(go.Histogram(x=below_threshold, name='Normal', marker=dict(color='navy')))
    fig.add_trace(go.Histogram(x=above_threshold, name='Outliers', marker=dict(color='magenta')))
    
    # Add vertical line annotating the threshold
    fig.add_vline(x=threshold, line_dash="dash", line_color="#0d5087", annotation_text="Threshold",
                  annotation_position="top right", annotation=dict(font=dict(color="#27496d", size=12)))

    # Update layout
    fig.update_layout(
        title=title,
        xaxis_title="Anomaly Score",
        yaxis_title="Frequency",
        showlegend=True,
        height=450,
        width=800,
        bargroupgap=0.1,
        bargap=0.1,
        paper_bgcolor='rgba(0,0,0,0)',
        plot_bgcolor='rgba(0,0,0,0)'
    )

    fig.update_xaxes(range=[-0.1, 15])

    # Show the plot
    fig.show()

# Plotting the combination by average for test data
plot_histogram_with_threshold(np.array(y_test_scores), "Anomaly Scores for Clf1 Model", threshold)

Model 1 Stats¶

In [80]:
def descriptive_stat_threshold(df,pred_score, threshold):
    # Let's see how many '0's and '1's.
    df = pd.DataFrame(df)
    df['Anomaly_Score'] = pred_score
    df['Group'] = np.where(df['Anomaly_Score']< threshold, 'Normal', 'Outlier')

    # Now let's show the summary statistics:
    cnt = df.groupby('Group')['Anomaly_Score'].count().reset_index().rename(columns={'Anomaly_Score':'Count'})
    cnt['Count %'] = (cnt['Count'] / cnt['Count'].sum()) * 100 # The count and count %
    stat = df.groupby('Group').mean().round(4).reset_index() # The avg.
    stat = cnt.merge(stat, left_on='Group',right_on='Group') # Put the count and the avg. together
    return (stat)
In [81]:
model_1 = descriptive_stat_threshold(X_test,y_test_scores, threshold)
model_1
Out[81]:
Group Count Count % 0 1 2 3 4 5 6 7 8 9 Anomaly_Score
0 Normal 30997 95.044921 -0.0703 -0.076 -0.0625 -0.0079 -0.0930 -0.0741 -0.0113 -0.0572 -0.0390 -0.0040 1.8847
1 Outlier 1616 4.955079 1.3489 1.458 1.1988 0.1511 1.7835 1.4205 0.2177 1.0980 0.7479 0.0776 9.4060

4.2 Model 2 ¶

In [82]:
clf2 = AutoEncoder(contamination=0.05, hidden_neurons =[25, 10, 2, 10, 25])
clf2.fit(X_train)
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape              ┃    Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ dense_7 (Dense)                 │ (None, 10)                │        110 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_6 (Dropout)             │ (None, 10)                │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_8 (Dense)                 │ (None, 10)                │        110 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_7 (Dropout)             │ (None, 10)                │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_9 (Dense)                 │ (None, 25)                │        275 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_8 (Dropout)             │ (None, 25)                │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_10 (Dense)                │ (None, 10)                │        260 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_9 (Dropout)             │ (None, 10)                │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_11 (Dense)                │ (None, 2)                 │         22 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_10 (Dropout)            │ (None, 2)                 │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_12 (Dense)                │ (None, 10)                │         30 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_11 (Dropout)            │ (None, 10)                │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_13 (Dense)                │ (None, 25)                │        275 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_12 (Dropout)            │ (None, 25)                │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_14 (Dense)                │ (None, 10)                │        260 │
└─────────────────────────────────┴───────────────────────────┴────────────┘
 Total params: 1,342 (5.24 KB)
 Trainable params: 1,342 (5.24 KB)
 Non-trainable params: 0 (0.00 B)
None
Epoch 1/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 3s 402us/step - loss: 7.7795 - val_loss: 1.3388
Epoch 2/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 366us/step - loss: 1.2038 - val_loss: 1.0723
Epoch 3/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 368us/step - loss: 1.0422 - val_loss: 1.0297
Epoch 4/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 372us/step - loss: 1.0156 - val_loss: 1.0248
Epoch 5/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 354us/step - loss: 0.9989 - val_loss: 1.0239
Epoch 6/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 351us/step - loss: 0.9979 - val_loss: 1.0237
Epoch 7/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 365us/step - loss: 0.9875 - val_loss: 1.0237
Epoch 8/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 388us/step - loss: 0.9978 - val_loss: 1.0237
Epoch 9/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 363us/step - loss: 0.9980 - val_loss: 1.0239
Epoch 10/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 370us/step - loss: 0.9968 - val_loss: 1.0236
Epoch 11/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 1.0054 - val_loss: 1.0236
Epoch 12/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 362us/step - loss: 0.9961 - val_loss: 1.0236
Epoch 13/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 1.0104 - val_loss: 1.0236
Epoch 14/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 0.9977 - val_loss: 1.0236
Epoch 15/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 358us/step - loss: 0.9966 - val_loss: 1.0236
Epoch 16/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 377us/step - loss: 0.9947 - val_loss: 1.0236
Epoch 17/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 354us/step - loss: 1.0099 - val_loss: 1.0236
Epoch 18/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 361us/step - loss: 0.9948 - val_loss: 1.0236
Epoch 19/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 0.9891 - val_loss: 1.0236
Epoch 20/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 351us/step - loss: 1.0029 - val_loss: 1.0236
Epoch 21/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 347us/step - loss: 1.0241 - val_loss: 1.0236
Epoch 22/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 348us/step - loss: 1.0163 - val_loss: 1.0236
Epoch 23/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 353us/step - loss: 0.9938 - val_loss: 1.0236
Epoch 24/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 1.0016 - val_loss: 1.0236
Epoch 25/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 0.9929 - val_loss: 1.0236
Epoch 26/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 362us/step - loss: 0.9988 - val_loss: 1.0236
Epoch 27/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 353us/step - loss: 0.9953 - val_loss: 1.0236
Epoch 28/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 444us/step - loss: 1.0125 - val_loss: 1.0236
Epoch 29/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 373us/step - loss: 1.0145 - val_loss: 1.0236
Epoch 30/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 368us/step - loss: 0.9976 - val_loss: 1.0236
Epoch 31/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 1.0040 - val_loss: 1.0236
Epoch 32/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 351us/step - loss: 0.9900 - val_loss: 1.0236
Epoch 33/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 345us/step - loss: 0.9995 - val_loss: 1.0236
Epoch 34/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 1.0074 - val_loss: 1.0236
Epoch 35/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 363us/step - loss: 1.0093 - val_loss: 1.0236
Epoch 36/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 362us/step - loss: 1.0168 - val_loss: 1.0236
Epoch 37/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 0.9934 - val_loss: 1.0236
Epoch 38/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 394us/step - loss: 0.9876 - val_loss: 1.0236
Epoch 39/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 374us/step - loss: 0.9973 - val_loss: 1.0236
Epoch 40/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 373us/step - loss: 1.0041 - val_loss: 1.0236
Epoch 41/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 368us/step - loss: 0.9988 - val_loss: 1.0236
Epoch 42/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 356us/step - loss: 0.9813 - val_loss: 1.0236
Epoch 43/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 1.0024 - val_loss: 1.0236
Epoch 44/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 0.9981 - val_loss: 1.0236
Epoch 45/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 353us/step - loss: 0.9961 - val_loss: 1.0236
Epoch 46/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 351us/step - loss: 0.9931 - val_loss: 1.0236
Epoch 47/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 356us/step - loss: 0.9947 - val_loss: 1.0236
Epoch 48/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 356us/step - loss: 1.0011 - val_loss: 1.0236
Epoch 49/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 1.0044 - val_loss: 1.0236
Epoch 50/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 353us/step - loss: 0.9825 - val_loss: 1.0236
Epoch 51/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 354us/step - loss: 1.0033 - val_loss: 1.0236
Epoch 52/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 353us/step - loss: 1.0066 - val_loss: 1.0236
Epoch 53/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 1.0043 - val_loss: 1.0236
Epoch 54/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 354us/step - loss: 1.0005 - val_loss: 1.0236
Epoch 55/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 1.0041 - val_loss: 1.0236
Epoch 56/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 1.0079 - val_loss: 1.0236
Epoch 57/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 1.0213 - val_loss: 1.0236
Epoch 58/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 358us/step - loss: 1.0049 - val_loss: 1.0236
Epoch 59/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 354us/step - loss: 1.0110 - val_loss: 1.0236
Epoch 60/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 361us/step - loss: 0.9978 - val_loss: 1.0236
Epoch 61/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 353us/step - loss: 1.0106 - val_loss: 1.0236
Epoch 62/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 365us/step - loss: 1.0068 - val_loss: 1.0236
Epoch 63/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 482us/step - loss: 0.9943 - val_loss: 1.0236
Epoch 64/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 354us/step - loss: 1.0123 - val_loss: 1.0236
Epoch 65/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 1.0084 - val_loss: 1.0236
Epoch 66/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 351us/step - loss: 1.0025 - val_loss: 1.0236
Epoch 67/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 0.9980 - val_loss: 1.0236
Epoch 68/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 367us/step - loss: 0.9887 - val_loss: 1.0236
Epoch 69/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 1.0057 - val_loss: 1.0236
Epoch 70/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 1.0071 - val_loss: 1.0236
Epoch 71/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 405us/step - loss: 0.9925 - val_loss: 1.0236
Epoch 72/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 0.9862 - val_loss: 1.0236
Epoch 73/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 425us/step - loss: 0.9981 - val_loss: 1.0236
Epoch 74/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 362us/step - loss: 1.0103 - val_loss: 1.0236
Epoch 75/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 389us/step - loss: 0.9842 - val_loss: 1.0236
Epoch 76/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 1.0198 - val_loss: 1.0236
Epoch 77/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 410us/step - loss: 0.9958 - val_loss: 1.0236
Epoch 78/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 363us/step - loss: 0.9944 - val_loss: 1.0236
Epoch 79/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 0.9936 - val_loss: 1.0236
Epoch 80/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 361us/step - loss: 0.9860 - val_loss: 1.0236
Epoch 81/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 0.9970 - val_loss: 1.0236
Epoch 82/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 360us/step - loss: 1.0146 - val_loss: 1.0236
Epoch 83/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 354us/step - loss: 0.9852 - val_loss: 1.0236
Epoch 84/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 356us/step - loss: 0.9896 - val_loss: 1.0236
Epoch 85/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 1.0000 - val_loss: 1.0236
Epoch 86/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 1.0012 - val_loss: 1.0236
Epoch 87/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 1.0114 - val_loss: 1.0236
Epoch 88/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 358us/step - loss: 0.9845 - val_loss: 1.0236
Epoch 89/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 367us/step - loss: 1.0198 - val_loss: 1.0236
Epoch 90/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 0.9945 - val_loss: 1.0236
Epoch 91/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 353us/step - loss: 0.9843 - val_loss: 1.0236
Epoch 92/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 358us/step - loss: 1.0006 - val_loss: 1.0236
Epoch 93/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 0.9787 - val_loss: 1.0236
Epoch 94/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 350us/step - loss: 0.9935 - val_loss: 1.0236
Epoch 95/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 359us/step - loss: 0.9970 - val_loss: 1.0236
Epoch 96/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 3s 714us/step - loss: 1.0008 - val_loss: 1.0236
Epoch 97/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 373us/step - loss: 0.9962 - val_loss: 1.0236
Epoch 98/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 461us/step - loss: 1.0045 - val_loss: 1.0236
Epoch 99/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 501us/step - loss: 0.9937 - val_loss: 1.0236
Epoch 100/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 376us/step - loss: 1.0152 - val_loss: 1.0236
4077/4077 ━━━━━━━━━━━━━━━━━━━━ 3s 652us/step
Out[82]:
AutoEncoder(batch_size=32, contamination=0.05, dropout_rate=0.2, epochs=100,
      hidden_activation='relu', hidden_neurons=[25, 10, 2, 10, 25],
      l2_regularizer=0.1,
      loss=<function mean_squared_error at 0x16e35dbc0>, optimizer='adam',
      output_activation='sigmoid', preprocessing=True, random_state=None,
      validation_size=0.1, verbose=1)
In [83]:
# Get y train scores
y_train_scores2 = clf2.decision_scores_

threshold2 = clf2.threshold_

# get the prediction on the test data
y_test_pred2 = clf2.predict(X_test)  # outlier labels (0 or 1)

# Predict anomaly score of X using fitted detector
y_test_scores2 = clf2.decision_function(X_test)  # outlier scores

y_test_pred2 = pd.Series(y_test_pred2)
y_test_scores2 = pd.Series(y_test_scores2)
1020/1020 ━━━━━━━━━━━━━━━━━━━━ 0s 283us/step
1020/1020 ━━━━━━━━━━━━━━━━━━━━ 0s 171us/step
In [84]:
print(y_test_scores2.describe(), '\n')
print("Threshold for Contamination Rate:", clf2.threshold_, '\n')
clf2.get_params()
count    32613.000000
mean         2.257347
std          2.214619
min          0.035605
25%          0.990695
50%          1.493880
75%          2.679262
max         55.760895
dtype: float64 

Threshold for Contamination Rate: 6.585046138271961 

Out[84]:
{'batch_size': 32,
 'contamination': 0.05,
 'dropout_rate': 0.2,
 'epochs': 100,
 'hidden_activation': 'relu',
 'hidden_neurons': [25, 10, 2, 10, 25],
 'l2_regularizer': 0.1,
 'loss': <function keras.src.losses.losses.mean_squared_error(y_true, y_pred)>,
 'optimizer': 'adam',
 'output_activation': 'sigmoid',
 'preprocessing': True,
 'random_state': None,
 'validation_size': 0.1,
 'verbose': 1}
In [85]:
plot_histogram_with_threshold(np.array(y_test_scores2), "Anomaly Scores for Clf2 Model", threshold2)

Model 2 Stats¶

In [86]:
model_2 = descriptive_stat_threshold(X_test,y_test_scores2, threshold2)
model_2
Out[86]:
Group Count Count % 0 1 2 3 4 5 6 7 8 9 Anomaly_Score
0 Normal 30997 95.044921 -0.0703 -0.076 -0.0625 -0.0079 -0.0930 -0.0741 -0.0113 -0.0572 -0.0390 -0.0040 1.8847
1 Outlier 1616 4.955079 1.3489 1.458 1.1988 0.1511 1.7835 1.4205 0.2177 1.0980 0.7479 0.0776 9.4060

4.3 Model 3 ¶

In [87]:
clf3 = AutoEncoder(contamination=0.05, hidden_neurons =[25, 15, 10, 2, 10, 15, 25])
clf3.fit(X_train)
Model: "sequential_2"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape              ┃    Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ dense_15 (Dense)                │ (None, 10)                │        110 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_13 (Dropout)            │ (None, 10)                │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_16 (Dense)                │ (None, 10)                │        110 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_14 (Dropout)            │ (None, 10)                │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_17 (Dense)                │ (None, 25)                │        275 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_15 (Dropout)            │ (None, 25)                │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_18 (Dense)                │ (None, 15)                │        390 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_16 (Dropout)            │ (None, 15)                │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_19 (Dense)                │ (None, 10)                │        160 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_17 (Dropout)            │ (None, 10)                │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_20 (Dense)                │ (None, 2)                 │         22 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_18 (Dropout)            │ (None, 2)                 │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_21 (Dense)                │ (None, 10)                │         30 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_19 (Dropout)            │ (None, 10)                │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_22 (Dense)                │ (None, 15)                │        165 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_20 (Dropout)            │ (None, 15)                │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_23 (Dense)                │ (None, 25)                │        400 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dropout_21 (Dropout)            │ (None, 25)                │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense_24 (Dense)                │ (None, 10)                │        260 │
└─────────────────────────────────┴───────────────────────────┴────────────┘
 Total params: 1,922 (7.51 KB)
 Trainable params: 1,922 (7.51 KB)
 Non-trainable params: 0 (0.00 B)
None
Epoch 1/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 422us/step - loss: 8.7458 - val_loss: 1.3073
Epoch 2/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 404us/step - loss: 1.2296 - val_loss: 1.0455
Epoch 3/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 463us/step - loss: 1.0214 - val_loss: 1.0036
Epoch 4/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 412us/step - loss: 1.0065 - val_loss: 0.9988
Epoch 5/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 405us/step - loss: 0.9820 - val_loss: 0.9981
Epoch 6/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 408us/step - loss: 0.9898 - val_loss: 0.9980
Epoch 7/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 406us/step - loss: 1.0113 - val_loss: 0.9980
Epoch 8/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 402us/step - loss: 0.9922 - val_loss: 0.9980
Epoch 9/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 407us/step - loss: 0.9925 - val_loss: 0.9980
Epoch 10/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 535us/step - loss: 0.9955 - val_loss: 0.9979
Epoch 11/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 425us/step - loss: 0.9814 - val_loss: 0.9979
Epoch 12/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 418us/step - loss: 0.9953 - val_loss: 0.9979
Epoch 13/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 415us/step - loss: 0.9804 - val_loss: 0.9979
Epoch 14/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 400us/step - loss: 1.0039 - val_loss: 0.9979
Epoch 15/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 405us/step - loss: 1.0154 - val_loss: 0.9979
Epoch 16/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 410us/step - loss: 0.9973 - val_loss: 0.9979
Epoch 17/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 405us/step - loss: 1.0224 - val_loss: 0.9979
Epoch 18/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 407us/step - loss: 1.0069 - val_loss: 0.9979
Epoch 19/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 407us/step - loss: 0.9886 - val_loss: 0.9979
Epoch 20/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 403us/step - loss: 1.0006 - val_loss: 0.9979
Epoch 21/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 401us/step - loss: 0.9965 - val_loss: 0.9979
Epoch 22/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 410us/step - loss: 1.0030 - val_loss: 0.9979
Epoch 23/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 475us/step - loss: 1.0041 - val_loss: 0.9979
Epoch 24/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 421us/step - loss: 0.9931 - val_loss: 0.9979
Epoch 25/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 527us/step - loss: 1.0115 - val_loss: 0.9979
Epoch 26/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 428us/step - loss: 1.0209 - val_loss: 0.9979
Epoch 27/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 407us/step - loss: 0.9817 - val_loss: 0.9979
Epoch 28/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 405us/step - loss: 1.0061 - val_loss: 0.9979
Epoch 29/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 405us/step - loss: 0.9922 - val_loss: 0.9979
Epoch 30/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 404us/step - loss: 0.9945 - val_loss: 0.9979
Epoch 31/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 405us/step - loss: 1.0085 - val_loss: 0.9979
Epoch 32/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 439us/step - loss: 1.0008 - val_loss: 0.9979
Epoch 33/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 418us/step - loss: 1.0013 - val_loss: 0.9979
Epoch 34/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 413us/step - loss: 1.0056 - val_loss: 0.9979
Epoch 35/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 408us/step - loss: 1.0070 - val_loss: 0.9979
Epoch 36/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 407us/step - loss: 1.0053 - val_loss: 0.9979
Epoch 37/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 407us/step - loss: 1.0043 - val_loss: 0.9979
Epoch 38/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 405us/step - loss: 0.9988 - val_loss: 0.9979
Epoch 39/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 405us/step - loss: 1.0040 - val_loss: 0.9979
Epoch 40/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 402us/step - loss: 1.0058 - val_loss: 0.9979
Epoch 41/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 404us/step - loss: 0.9919 - val_loss: 0.9979
Epoch 42/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 413us/step - loss: 1.0083 - val_loss: 0.9979
Epoch 43/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 612us/step - loss: 1.0175 - val_loss: 0.9979
Epoch 44/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 3s 682us/step - loss: 0.9834 - val_loss: 0.9979
Epoch 45/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 452us/step - loss: 0.9951 - val_loss: 0.9979
Epoch 46/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 458us/step - loss: 0.9848 - val_loss: 0.9979
Epoch 47/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 421us/step - loss: 1.0215 - val_loss: 0.9979
Epoch 48/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 506us/step - loss: 1.0114 - val_loss: 0.9979
Epoch 49/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 407us/step - loss: 1.0051 - val_loss: 0.9979
Epoch 50/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 403us/step - loss: 1.0061 - val_loss: 0.9979
Epoch 51/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 408us/step - loss: 1.0102 - val_loss: 0.9979
Epoch 52/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 400us/step - loss: 1.0008 - val_loss: 0.9979
Epoch 53/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 399us/step - loss: 1.0064 - val_loss: 0.9979
Epoch 54/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 437us/step - loss: 0.9945 - val_loss: 0.9979
Epoch 55/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 398us/step - loss: 0.9881 - val_loss: 0.9979
Epoch 56/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 400us/step - loss: 1.0059 - val_loss: 0.9979
Epoch 57/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 400us/step - loss: 1.0082 - val_loss: 0.9979
Epoch 58/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 399us/step - loss: 1.0085 - val_loss: 0.9979
Epoch 59/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 398us/step - loss: 0.9843 - val_loss: 0.9979
Epoch 60/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 400us/step - loss: 1.0111 - val_loss: 0.9979
Epoch 61/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 399us/step - loss: 1.0046 - val_loss: 0.9979
Epoch 62/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 408us/step - loss: 1.0008 - val_loss: 0.9979
Epoch 63/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 399us/step - loss: 0.9842 - val_loss: 0.9979
Epoch 64/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 404us/step - loss: 0.9973 - val_loss: 0.9979
Epoch 65/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 400us/step - loss: 0.9965 - val_loss: 0.9979
Epoch 66/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 3s 717us/step - loss: 0.9918 - val_loss: 0.9979
Epoch 67/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 429us/step - loss: 1.0023 - val_loss: 0.9979
Epoch 68/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 413us/step - loss: 0.9989 - val_loss: 0.9979
Epoch 69/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 461us/step - loss: 0.9998 - val_loss: 0.9979
Epoch 70/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 3s 910us/step - loss: 0.9970 - val_loss: 0.9979
Epoch 71/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 565us/step - loss: 1.0104 - val_loss: 0.9979
Epoch 72/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 478us/step - loss: 0.9850 - val_loss: 0.9979
Epoch 73/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 434us/step - loss: 1.0053 - val_loss: 0.9979
Epoch 74/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 436us/step - loss: 0.9883 - val_loss: 0.9979
Epoch 75/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 461us/step - loss: 1.0017 - val_loss: 0.9979
Epoch 76/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 447us/step - loss: 1.0210 - val_loss: 0.9979
Epoch 77/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 437us/step - loss: 1.0006 - val_loss: 0.9979
Epoch 78/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 447us/step - loss: 1.0070 - val_loss: 0.9979
Epoch 79/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 401us/step - loss: 1.0038 - val_loss: 0.9979
Epoch 80/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 464us/step - loss: 0.9899 - val_loss: 0.9979
Epoch 81/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 488us/step - loss: 1.0153 - val_loss: 0.9979
Epoch 82/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 453us/step - loss: 1.0067 - val_loss: 0.9979
Epoch 83/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 436us/step - loss: 1.0045 - val_loss: 0.9979
Epoch 84/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 540us/step - loss: 1.0186 - val_loss: 0.9979
Epoch 85/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 461us/step - loss: 0.9907 - val_loss: 0.9979
Epoch 86/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 497us/step - loss: 1.0027 - val_loss: 0.9979
Epoch 87/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 462us/step - loss: 0.9925 - val_loss: 0.9979
Epoch 88/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 418us/step - loss: 1.0269 - val_loss: 0.9979
Epoch 89/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 450us/step - loss: 1.0073 - val_loss: 0.9979
Epoch 90/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 442us/step - loss: 0.9823 - val_loss: 0.9979
Epoch 91/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 427us/step - loss: 1.0087 - val_loss: 0.9979
Epoch 92/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 430us/step - loss: 1.0061 - val_loss: 0.9979
Epoch 93/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 445us/step - loss: 0.9937 - val_loss: 0.9979
Epoch 94/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 437us/step - loss: 1.0031 - val_loss: 0.9979
Epoch 95/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 445us/step - loss: 1.0143 - val_loss: 0.9979
Epoch 96/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 427us/step - loss: 1.0064 - val_loss: 0.9979
Epoch 97/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 424us/step - loss: 0.9956 - val_loss: 0.9979
Epoch 98/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 425us/step - loss: 0.9956 - val_loss: 0.9979
Epoch 99/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 419us/step - loss: 0.9847 - val_loss: 0.9979
Epoch 100/100
3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 421us/step - loss: 1.0009 - val_loss: 0.9979
4077/4077 ━━━━━━━━━━━━━━━━━━━━ 1s 182us/step
Out[87]:
AutoEncoder(batch_size=32, contamination=0.05, dropout_rate=0.2, epochs=100,
      hidden_activation='relu', hidden_neurons=[25, 15, 10, 2, 10, 15, 25],
      l2_regularizer=0.1,
      loss=<function mean_squared_error at 0x16e35dbc0>, optimizer='adam',
      output_activation='sigmoid', preprocessing=True, random_state=None,
      validation_size=0.1, verbose=1)
In [88]:
# Get y train scores
y_train_scores3 = clf3.decision_scores_

threshold3 = clf3.threshold_

# get the prediction on the test data
y_test_pred3 = clf3.predict(X_test)  # outlier labels (0 or 1)

# Predict anomaly score of X using fitted detector
y_test_scores3 = clf3.decision_function(X_test)  # outlier scores

y_test_pred3 = pd.Series(y_test_pred3)
y_test_scores3 = pd.Series(y_test_scores3)
1020/1020 ━━━━━━━━━━━━━━━━━━━━ 0s 218us/step
1020/1020 ━━━━━━━━━━━━━━━━━━━━ 0s 166us/step
In [89]:
print(y_test_scores3.describe(), '\n')
print("Threshold for Contamination Rate:", clf3.threshold_, '\n')
clf3.get_params()
count    32613.000000
mean         2.257346
std          2.214620
min          0.035614
25%          0.990717
50%          1.493887
75%          2.679276
max         55.760933
dtype: float64 

Threshold for Contamination Rate: 6.585096522818382 

Out[89]:
{'batch_size': 32,
 'contamination': 0.05,
 'dropout_rate': 0.2,
 'epochs': 100,
 'hidden_activation': 'relu',
 'hidden_neurons': [25, 15, 10, 2, 10, 15, 25],
 'l2_regularizer': 0.1,
 'loss': <function keras.src.losses.losses.mean_squared_error(y_true, y_pred)>,
 'optimizer': 'adam',
 'output_activation': 'sigmoid',
 'preprocessing': True,
 'random_state': None,
 'validation_size': 0.1,
 'verbose': 1}
In [90]:
plot_histogram_with_threshold(np.array(y_test_scores3), "Anomaly Scores for Clf3 Model", clf3.threshold_)

Model 3 Stats¶

In [91]:
model_3 = descriptive_stat_threshold(X_test,y_test_scores3, threshold3)
model_3
Out[91]:
Group Count Count % 0 1 2 3 4 5 6 7 8 9 Anomaly_Score
0 Normal 30997 95.044921 -0.0703 -0.076 -0.0625 -0.0079 -0.0930 -0.0741 -0.0113 -0.0572 -0.0390 -0.0040 1.8847
1 Outlier 1616 4.955079 1.3489 1.458 1.1988 0.1511 1.7835 1.4205 0.2177 1.0980 0.7479 0.0776 9.4060

4.4 Aggregate Multiple Models ¶

In [92]:
# Put all the predictions in a data frame
train_scores = pd.DataFrame({'clf1': clf1.decision_scores_,
                             'clf2': clf2.decision_scores_,
                             'clf3': clf3.decision_scores_
                            })

test_scores  = pd.DataFrame({'clf1': clf1.decision_function(X_test),
                             'clf2': clf2.decision_function(X_test),
                             'clf3': clf3.decision_function(X_test) 
                            })
1020/1020 ━━━━━━━━━━━━━━━━━━━━ 0s 397us/step
1020/1020 ━━━━━━━━━━━━━━━━━━━━ 0s 180us/step
1020/1020 ━━━━━━━━━━━━━━━━━━━━ 0s 166us/step
In [93]:
train_scores.head()
Out[93]:
clf1 clf2 clf3
0 0.816857 0.816857 0.816852
1 4.866587 4.866563 4.866580
2 3.215261 3.215289 3.215277
3 2.800463 2.800432 2.800467
4 1.299050 1.299050 1.299045
In [94]:
test_scores.head()
Out[94]:
clf1 clf2 clf3
0 1.261936 1.261924 1.261927
1 1.462396 1.462378 1.462387
2 0.892089 0.892093 0.892095
3 0.953255 0.953249 0.953248
4 0.921061 0.921044 0.921054

4.5 Combination by Average ¶

In [95]:
# Standardize the decision scores
train_scores_norm, test_scores_norm = standardizer(train_scores,test_scores)
In [96]:
# Combination by average
y_train_by_average = average(train_scores_norm)
y_test_by_average = average(test_scores_norm)
y_test_by_average[1:10]
Out[96]:
array([-0.35827573, -0.61566098, -0.588059  , -0.60259031, -0.77464724,
       -0.57074008, -0.53622878,  0.75240693, -0.16169526])
In [97]:
contamination = 0.05
threshold_index = int((1 - contamination) * len(y_train_by_average))
threshold_avg = np.sort(y_train_by_average)[threshold_index]
threshold_avg
Out[97]:
1.9541912732491085
In [98]:
# Function to plot histograms with threshold annotation and color differentiation
def plot_histogram_with_threshold_new(scores, title, threshold):
    fig = go.Figure()

    # Split the scores into two parts: below the threshold and above/equal to the threshold
    below_threshold = scores[scores < threshold]
    above_threshold = scores[scores >= threshold]

    # Add histogram traces for the two parts
    fig.add_trace(go.Histogram(x=below_threshold, name='Normal', marker=dict(color='navy')))
    fig.add_trace(go.Histogram(x=above_threshold, name='Outliers', marker=dict(color='Magenta')))
    
    # Add vertical line annotating the threshold
    fig.add_vline(x=threshold, line_dash="dash", line_color="#0d5087", annotation_text="Threshold",
                  annotation_position="top right", annotation=dict(font=dict(color="#27496d", size=12)))

    # Update layout
    fig.update_layout(
        title=title,
        xaxis_title="Anomaly Score",
        yaxis_title="Frequency",
        showlegend=True,
        height=450,
        width=800,
        paper_bgcolor='rgba(0,0,0,0)',
        plot_bgcolor='rgba(0,0,0,0)'
    )

    fig.update_xaxes(range=[-0.1, 10])

    # Show the plot
    fig.show()
In [99]:
plot_histogram_with_threshold_new(np.array(y_train_by_average), "Anomaly Scores for Combination by Average", threshold_avg)

Combination by Avg Stats¶

In [100]:
auto_avg=descriptive_stat_threshold(X_test,y_test_by_average, threshold_avg)
auto_avg
Out[100]:
Group Count Count % 0 1 2 3 4 5 6 7 8 9 Anomaly_Score
0 Normal 30997 95.044921 -0.0703 -0.076 -0.0625 -0.0079 -0.0930 -0.0741 -0.0113 -0.0572 -0.0390 -0.0040 -0.1677
1 Outlier 1616 4.955079 1.3489 1.458 1.1988 0.1511 1.7835 1.4205 0.2177 1.0980 0.7479 0.0776 3.2268

4.5.1 Combination by AVG Outlier List ¶

In [101]:
# Add a column 'Prediction_Comb_Avg' to final_df and initialize with 0 (not outlier)
final_df['Prediction_Comb_Avg'] = 0

# Mark outliers in training data
final_df.loc[X_train.index[y_train_by_average >= threshold_avg], 'Prediction_Comb_Avg'] = 1

# Mark outliers in test data
final_df.loc[X_test.index[y_test_by_average >= threshold_avg], 'Prediction_Comb_Avg'] = 1
final_df.head()
Out[101]:
DRG Provider_Id Provider_Name Provider_StreetAddress Provider_City Provider_State Provider_Zipcode Hospital_referral_region_desp Total_Discharges Average_Covered_Charges ... Average_Medicare_Bystates_Ratio Average_Medicare_Bystatescity_Ratio Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio Prediction_Comb_Avg
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 ... 1.035685 1.038763 1.0 1.005855 1.037810 1.0 0.885921 1.033356 1.0 0
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10005 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH BOAZ AL 35957 AL - Birmingham 14 15131.85 ... 1.081989 1.000000 1.0 1.007653 1.000000 1.0 0.708782 1.000000 1.0 0
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10006 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET FLORENCE AL 35631 AL - Birmingham 24 37560.37 ... 0.968301 1.000000 1.0 0.946260 1.000000 1.0 0.857643 1.000000 1.0 0
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10011 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE BIRMINGHAM AL 35235 AL - Birmingham 25 13998.28 ... 0.897723 0.899200 1.0 0.943232 0.915208 1.0 1.126205 0.970586 1.0 0
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10016 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH ALABASTER AL 35007 AL - Birmingham 18 31633.27 ... 1.054754 1.000000 1.0 0.985152 1.000000 1.0 0.705312 1.000000 1.0 0

5 rows × 24 columns

In [102]:
# Filter out outliers
outliers_comb = final_df[final_df['Prediction_Comb_Avg'] == 1]

# Display the list of outliers
outliers_list_comb = outliers_comb[['DRG', 'Provider_Id', 'Provider_Name','Provider_State', 'Provider_City', 'Provider_Zipcode'] + ratio_columns + ['Prediction_Comb_Avg']]
print("List of Outliers:")
outliers_list_comb.head(5)
List of Outliers:
Out[102]:
DRG Provider_Id Provider_Name Provider_State Provider_City Provider_Zipcode Average_Medicare_byDRG_Ratio Average_Medicare_Bystates_Ratio Average_Medicare_Bystatescity_Ratio Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio Prediction_Comb_Avg
10 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10046 RIVERVIEW REGIONAL MEDICAL CENTER AL GADSDEN 35901 0.808802 0.976949 1.012007 1.0 0.950895 0.992776 1.0 0.846140 0.912300 1.0 1
14 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10083 SOUTH BALDWIN REGIONAL MEDICAL CENTER AL FOLEY 36535 0.789031 0.953069 1.000000 1.0 0.919792 1.000000 1.0 0.786001 1.000000 1.0 1
34 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 30038 SCOTTSDALE HEALTHCARE-OSBORN MEDICAL CENTER AZ SCOTTSDALE 85251 0.985090 0.925614 1.028059 1.0 0.906303 1.042642 1.0 0.822265 1.120502 1.0 1
43 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 30094 ARROWHEAD HOSPITAL AZ GLENDALE 85308 1.127600 1.059519 1.033826 1.0 0.968981 1.009958 1.0 0.574981 0.852184 1.0 1
53 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 40026 ST JOSEPHS MERCY HEALTH CENTER INC AR HOT SPRINGS 71903 0.856703 0.963753 0.716093 1.0 0.948941 0.742293 1.0 0.886955 0.890442 1.0 1
In [103]:
print(f' The Combination by AVG Outlier list contains {len(outliers_list_comb.groupby("Provider_Name").count())} providers who performed {len(outliers_list_comb.groupby("DRG").count())} procedures accross {len(outliers_list_comb.groupby("Provider_State").count())} states', '\n')
 The Combination by AVG Outlier list contains 2470 providers who performed 100 procedures accross 51 states 

4.5.2 Top 10 Outlier by DRG ¶

In [104]:
outliers_list_comb.groupby('DRG').agg(count = pd.NamedAgg(column = 'DRG', aggfunc = 'count')).sort_values(by= 'count', ascending= False)[1:11]
Out[104]:
count
DRG
191 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W CC 282
192 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W/O CC/MCC 278
193 - SIMPLE PNEUMONIA & PLEURISY W MCC 252
194 - SIMPLE PNEUMONIA & PLEURISY W CC 212
189 - PULMONARY EDEMA & RESPIRATORY FAILURE 199
177 - RESPIRATORY INFECTIONS & INFLAMMATIONS W MCC 193
065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W CC 192
069 - TRANSIENT ISCHEMIA 190
178 - RESPIRATORY INFECTIONS & INFLAMMATIONS W CC 180
066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W/O CC/MCC 178

4.5.3 Outlier by State ¶

In [105]:
outliers_list_comb.groupby('Provider_State').agg(count = pd.NamedAgg(column = 'Provider_State', aggfunc = 'count')).sort_values(by= 'count', ascending= False)[1:11]
Out[105]:
count
Provider_State
CA 674
TX 588
FL 546
NY 429
IL 390
PA 350
OH 336
NC 282
GA 262
MI 250
NJ 245
MO 221
VA 221
TN 200
IN 200
MD 176
AL 176
KY 175
MA 165
SC 162
LA 140
AZ 138
WI 135
MS 125
OK 120
WA 113
AR 110
IA 102
MN 102
CT 101
KS 84
CO 82
WV 78
OR 57
NH 56
NE 53
NV 52
ME 48
NM 44
UT 30
ID 29
RI 28
SD 27
HI 26
MT 24
ND 21
DC 20
AK 18
VT 12
DE 12
WY 10

4.6 Combination by Max ¶

In [106]:
# Combination by max
y_by_maximization = maximization(test_scores_norm)
y_by_maximization
Out[106]:
array([-0.44874417, -0.35827236, -0.61565891, ..., -0.71297992,
       -0.2599202 , -0.37496343])
In [107]:
plot_histogram_with_threshold_new(np.array(y_by_maximization), "Anomaly Scores for Combination by Max", threshold_avg)

Combination by Max Stats¶

In [108]:
auto_max = descriptive_stat_threshold(X_test,y_by_maximization, threshold_avg)
auto_max
Out[108]:
Group Count Count % 0 1 2 3 4 5 6 7 8 9 Anomaly_Score
0 Normal 30997 95.044921 -0.0703 -0.076 -0.0625 -0.0079 -0.0930 -0.0741 -0.0113 -0.0572 -0.0390 -0.0040 -0.1677
1 Outlier 1616 4.955079 1.3489 1.458 1.1988 0.1511 1.7835 1.4205 0.2177 1.0980 0.7479 0.0776 3.2268

Key Takeways

  • All 3 models were tested using different range of hidden layers and ecoding-decoding the data
  • To validate the scores, all models were combined by taking the average and max scores to reduce the chance of overfitting or underfitting the data
  • As shown on the stats table above, both average and max models were able to captured 1,616 outliers with a 3.23 anomaly score compared to -0.17 anomaly score of the normal group, which accounted for 0.5% of the data

Significant Features for identifying outliers based on the provided data are:

  • Avg Med by DRG
  • Avg Med by State
  • Avg Med by State, City
  • Avg Total by State
  • Avg Total by State, City

These features show the largest differences between the normal and outlier groups, indicating they are strong indicators of anomalous behavior in the dataset

Section 5 iForest ¶

5.1 iForest ¶

Isolation Forest is an ensemble-based anomaly detection method that isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. It is based on the fact that anomalies are few and different, thus they are more susceptible to isolation.

Use Cases:

  • Anomaly Detection
  • Preprocessing: Identify and removing anomalies before further data processing or modeling
In [109]:
isft = IForest(contamination=0.05, max_samples=40, behaviour='new') 
isft.fit(X_train)

# Training data
y_train_scores_isft = isft.decision_function(X_train)
y_train_pred_isft = isft.predict(X_train)

# Test data
y_test_scores_isft = isft.decision_function(X_test)
y_test_pred_isft = isft.predict(X_test) # outlier labels (0 or 1)

# Threshold for the defined comtanimation rate
forest_threshold = isft.threshold_
print("The threshold for the defined contamination rate:" , isft.threshold_)
The threshold for the defined contamination rate: -4.8620114688870467e-17
In [110]:
isft.get_params()
Out[110]:
{'behaviour': 'new',
 'bootstrap': False,
 'contamination': 0.05,
 'max_features': 1.0,
 'max_samples': 40,
 'n_estimators': 100,
 'n_jobs': 1,
 'random_state': None,
 'verbose': 0}
In [111]:
isft_vi = isft.feature_importances_
isft_vi
Out[111]:
array([0.13968889, 0.12664838, 0.10514943, 0.05940842, 0.11974233,
       0.11893943, 0.04756913, 0.11662462, 0.11054464, 0.05568474])

Fearture Importance¶

In [112]:
# Define the feature names corresponding to 0-9
feature_names = [
    'Average_Medicare_byDRG_Ratio',
    'Average_Medicare_Bystates_Ratio',
    'Average_Medicare_Bystatescity_Ratio',
    'Average_Medicare_Byzip_Ratio',
    'Average_Total_Bystates_Ratio',
    'Average_Total_Bystatescity_Ratio',
    'Average_Total_Byzip_Ratio',
    'Average_OOP_Bystates_Ratio',
    'Average_OOP_Bystatescity_Ratio',
    'Average_OOP_Byzip_Ratio'
]

# Create a DataFrame for plotting
for_plot = pd.DataFrame({'x_axis': feature_names, 'y_axis': isft_vi}).sort_values(by='y_axis', ascending=True)

# Create a horizontal bar plot using Plotly Express
fig = px.bar(for_plot, x='y_axis', y='x_axis', orientation='h', labels={'x_axis': 'Features', 'y_axis': 'Importance'},
             title='Feature Importances')

# Update layout for better visualization
fig.update_traces(marker_color='navy')
fig.update_layout(template='plotly_white', paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)')


# Show the plot
fig.show()

iForest Stats¶

In [113]:
forest = descriptive_stat_threshold(X_test,y_test_scores_isft, forest_threshold)
forest
Out[113]:
Group Count Count % 0 1 2 3 4 5 6 7 8 9 Anomaly_Score
0 Normal 31023 95.124644 -0.0899 -0.1045 -0.1027 -0.0338 -0.1119 -0.1106 -0.0408 -0.0336 -0.0289 -0.0078 -0.1998
1 Outlier 1590 4.875356 1.7541 2.0381 2.0043 0.6594 2.1835 2.1571 0.7968 0.6551 0.5637 0.1516 0.0449

5.2 Combination by Average ¶

In [114]:
# Test a range of maximum samples 
k_list = [20, 30, 40, 50, 60]
k_list = [100, 200, 300, 400, 500]
n_clf = len(k_list)
# Just prepare data frames so we can store the model results
train_scores = np.zeros([X_train.shape[0], n_clf])
test_scores = np.zeros([X_test.shape[0], n_clf])

# Modeling
for i in range(n_clf):
    k = k_list[i]
    #isft = IForest(contamination=0.05, max_samples=k) 
    isft = IForest(contamination=0.05, n_estimators=k) 
    isft.fit(X_train)
    
    # Store the results in each column:
    train_scores[:, i] = isft.decision_function(X_train) 
    test_scores[:, i] = isft.decision_function(X_test) 
# Decision scores have to be normalized before combination
train_scores_norm, test_scores_norm = standardizer(train_scores,test_scores)

y_train_by_average_isft = average(train_scores_norm)
y_test_by_average_isft = average(test_scores_norm)
In [115]:
contamination = 0.05
threshold_index = int((1 - contamination) * len(y_train_by_average_isft))
threshold_forest_avg = np.sort(y_train_by_average_isft)[threshold_index]
threshold_forest_avg
Out[115]:
2.263033002370265
In [116]:
plot_histogram_with_threshold_new(np.array(y_test_by_average_isft), "Anomaly Scores for iForest Combination by Avg", threshold_forest_avg)

iForest Combination by Avg Stats¶

In [117]:
forest_avg = descriptive_stat_threshold(X_test,y_test_by_average_isft, threshold_forest_avg)
forest_avg
Out[117]:
Group Count Count % 0 1 2 3 4 5 6 7 8 9 Anomaly_Score
0 Normal 31029 95.143041 -0.0721 -0.0762 -0.0580 -0.0078 -0.0892 -0.0709 -0.0136 -0.0419 -0.0358 -0.0090 -0.1454
1 Outlier 1584 4.856959 1.4120 1.4934 1.1361 0.1523 1.7481 1.3893 0.2663 0.8209 0.7006 0.1763 3.0876

5.2.1 Combination by AVG Outlier List ¶

In [118]:
# Add a column 'Prediction_Comb_Avg' to final_df and initialize with 0 (not outlier)
final_df['Prediction_Forest_Avg'] = 0

# Mark outliers in training data
final_df.loc[X_train.index[y_train_by_average_isft >= threshold_avg], 'Prediction_Forest_Avg'] = 1

# Mark outliers in test data
final_df.loc[X_test.index[y_test_by_average_isft >= threshold_avg], 'Prediction_Forest_Avg'] = 1
final_df.head()
Out[118]:
DRG Provider_Id Provider_Name Provider_StreetAddress Provider_City Provider_State Provider_Zipcode Hospital_referral_region_desp Total_Discharges Average_Covered_Charges ... Average_Medicare_Bystatescity_Ratio Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio Prediction_Comb_Avg Prediction_Forest_Avg
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 32963.07 ... 1.038763 1.0 1.005855 1.037810 1.0 0.885921 1.033356 1.0 0 0
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10005 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH BOAZ AL 35957 AL - Birmingham 14 15131.85 ... 1.000000 1.0 1.007653 1.000000 1.0 0.708782 1.000000 1.0 0 0
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10006 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET FLORENCE AL 35631 AL - Birmingham 24 37560.37 ... 1.000000 1.0 0.946260 1.000000 1.0 0.857643 1.000000 1.0 0 0
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10011 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE BIRMINGHAM AL 35235 AL - Birmingham 25 13998.28 ... 0.899200 1.0 0.943232 0.915208 1.0 1.126205 0.970586 1.0 0 0
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10016 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH ALABASTER AL 35007 AL - Birmingham 18 31633.27 ... 1.000000 1.0 0.985152 1.000000 1.0 0.705312 1.000000 1.0 0 0

5 rows × 25 columns

In [119]:
# Filter out outliers
outliers_forest = final_df[final_df['Prediction_Forest_Avg'] == 1]

# Display the list of outliers
outliers_list_forest = outliers_forest[['DRG', 'Provider_Id', 'Provider_Name','Provider_State', 'Provider_City', 'Provider_Zipcode'] + ratio_columns + ['Prediction_Forest_Avg']]
print("List of Outliers:")
outliers_list_forest.head(5)
List of Outliers:
Out[119]:
DRG Provider_Id Provider_Name Provider_State Provider_City Provider_Zipcode Average_Medicare_byDRG_Ratio Average_Medicare_Bystates_Ratio Average_Medicare_Bystatescity_Ratio Average_Medicare_Byzip_Ratio Average_Total_Bystates_Ratio Average_Total_Bystatescity_Ratio Average_Total_Byzip_Ratio Average_OOP_Bystates_Ratio Average_OOP_Bystatescity_Ratio Average_OOP_Byzip_Ratio Prediction_Forest_Avg
10 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10046 RIVERVIEW REGIONAL MEDICAL CENTER AL GADSDEN 35901 0.808802 0.976949 1.012007 1.0 0.950895 0.992776 1.0 0.846140 0.912300 1.0 1
12 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10056 ST VINCENT'S BIRMINGHAM AL BIRMINGHAM 35205 0.753445 0.910085 0.911582 1.0 0.935761 0.907959 1.0 1.038995 0.895427 1.0 1
14 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10083 SOUTH BALDWIN REGIONAL MEDICAL CENTER AL FOLEY 36535 0.789031 0.953069 1.000000 1.0 0.919792 1.000000 1.0 0.786001 1.000000 1.0 1
34 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 30038 SCOTTSDALE HEALTHCARE-OSBORN MEDICAL CENTER AZ SCOTTSDALE 85251 0.985090 0.925614 1.028059 1.0 0.906303 1.042642 1.0 0.822265 1.120502 1.0 1
43 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 30094 ARROWHEAD HOSPITAL AZ GLENDALE 85308 1.127600 1.059519 1.033826 1.0 0.968981 1.009958 1.0 0.574981 0.852184 1.0 1
In [120]:
print(f' The Combination by iForest AVG Outlier list contains {len(outliers_list_forest.groupby("Provider_Name").count())} providers who performed {len(outliers_list_forest.groupby("DRG").count())} procedures accross {len(outliers_list_forest.groupby("Provider_State").count())} states', '\n')
 The Combination by iForest AVG Outlier list contains 2616 providers who performed 100 procedures accross 51 states 

5.2.2 Top 10 Outliers by DRG¶

In [121]:
outliers_list_forest.groupby('DRG').agg(count = pd.NamedAgg(column = 'DRG', aggfunc = 'count')).sort_values(by= 'count', ascending= False)[1:11]
Out[121]:
count
DRG
191 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W CC 360
190 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W MCC 354
193 - SIMPLE PNEUMONIA & PLEURISY W MCC 331
194 - SIMPLE PNEUMONIA & PLEURISY W CC 265
065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W CC 257
189 - PULMONARY EDEMA & RESPIRATORY FAILURE 254
069 - TRANSIENT ISCHEMIA 248
066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W/O CC/MCC 243
178 - RESPIRATORY INFECTIONS & INFLAMMATIONS W CC 241
177 - RESPIRATORY INFECTIONS & INFLAMMATIONS W MCC 239

5.2.3 Outliers by State¶

In [122]:
outliers_list_forest.groupby('Provider_State').agg(count = pd.NamedAgg(column = 'Provider_State', aggfunc = 'count')).sort_values(by= 'count', ascending= False)[1:11]
Out[122]:
count
Provider_State
CA 863
TX 734
FL 700
NY 568
IL 511
PA 480
OH 441
NC 358
MI 346
GA 333
NJ 313
VA 307
MO 275
IN 269
TN 249
AL 231
MA 230
KY 229
MD 225
SC 221
AZ 180
WI 179
LA 176
MS 156
WA 155
OK 155
AR 147
MN 139
CT 128
IA 120
CO 115
KS 103
WV 99
NV 79
OR 72
NH 70
NE 68
ME 63
NM 58
UT 44
ID 41
RI 38
HI 36
SD 34
MT 30
ND 28
DC 27
AK 19
VT 17
DE 15
WY 11

Key Takeways

  • As shown on the stats table above, both average and max models were able to captured 1,583 outliers with a 3.06 anomaly score compared to -0.15 anomaly score of the normal group, which accounted for 0.5% of the data

Significant Features for identifying outliers based on the provided data are:

  • Avg Med by DRG
  • Avg Med by State
  • Avg Med by State, City
  • Avg Total by State
  • Avg Total by State, City
  • Avg OOP by State

These features show the largest differences between the normal and outlier groups, indicating they are strong indicators of anomalous behavior in the dataset

Section 6 Model Comparison ¶

In [123]:
# Put the actual, the HBO score and the ECOD score together
Actual_preds = pd.DataFrame({'AutoEcoder_Score': y_test_scores, 
                             'iForest_Score': y_test_scores_isft})

Actual_preds['AutoEncoder_pred'] = np.where(Actual_preds['AutoEcoder_Score']>clf1.threshold_,1,0)
Actual_preds['iForest_pred'] = np.where(Actual_preds['iForest_Score']>isft.threshold_,1,0)
Actual_preds.head()
Out[123]:
AutoEcoder_Score iForest_Score AutoEncoder_pred iForest_pred
0 1.261936 -0.259086 0 0
1 1.462396 -0.226800 0 0
2 0.892089 -0.247386 0 0
3 0.953255 -0.266837 0 0
4 0.921061 -0.246449 0 0
In [124]:
# Compute confusion matrix
cm = pd.crosstab(Actual_preds['AutoEncoder_pred'],Actual_preds['iForest_pred'])

# Plot confusion matrix as heatmap
trace2 = go.Heatmap(z=cm, 
                        x=['Predicted 0', 'Predicted 1'], 
                        y=['True 0', 'True 1'], 
                        showscale=False, 
                        colorscale=[
                            [0.0, "navy"],  
                            [0.005, "Magenta"],  
                            [1.0, "#ffffff"], 
                        ],
                        xgap=20,
                        ygap=20,
                        text=cm,
                        texttemplate="%{text}")

# Define layout
layout = go.Layout(
    title=dict(text="Confusion Matrix between AutoEcoder and iForest", x=0.5, y=0.9, xanchor='center', yanchor='top'),
    xaxis=dict(title='iForest Pred', showticklabels=True),
    yaxis=dict(title='AutoEcoder Pred', showticklabels=True),
    autosize=False,
    width=500,
    height=500,
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='rgba(0,0,0,0)'
)

# Plot heatmap
fig = go.Figure(data=[trace2], layout=layout)
fig.show()

Key Takeaways

  • Both AutoEcoder and iForest models were able to identify more than 30k instances as normal along with ~1.2k instances as outliers with some discrepancies between false negatives and false positive.